Scalable and distributed methods for entity matching, consolidation and disambiguation over linked...

35
Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora Aidan Hogan a,, Antoine Zimmermann b , Jürgen Umbrich a , Axel Polleres c , Stefan Decker a a Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland b INSA-Lyon, LIRIS, UMR5205, Villeurbanne F-69621, France c Siemens AG Österreich, Siemensstrasse 90, 1210 Vienna, Austria article info Article history: Available online 18 November 2011 Keywords: Entity consolidation Web data Linked Data RDF abstract With respect to large-scale, static, Linked Data corpora, in this paper we discuss scalable and distributed methods for entity consolidation (aka. smushing, entity resolution, object consolidation, etc.) to locate and process names that signify the same entity. We investigate (i) a baseline approach, which uses explicit owl: sameAs relations to perform consolidation; (ii) extended entity consolidation which additionally uses a subset of OWL 2 RL/RDF rules to derive novel owl:sameAs relations through the semantics of inverse-functional properties, functional-properties and (max-)cardinality restrictions with value one; (iii) deriving weighted concurrence measures between entities in the corpus based on shared inlinks/outlinks and attribute values using statistical analyses; (iv) disambiguating (initially) consolidated entities based on inconsistency detection using OWL 2 RL/RDF rules. Our methods are based upon distributed sorts and scans of the corpus, where we deliberately avoid the requirement for indexing all data. Throughout, we offer evaluation over a diverse Linked Data corpus consisting of 1.118 billion quadruples derived from a domain-agnostic, open crawl of 3.985 million RDF/XML Web documents, demonstrating the feasibility of our methods at that scale, and giving insights into the quality of the results for real-world data. Ó 2011 Elsevier B.V. All rights reserved. 1. Introduction Over a decade since the dawn of the Semantic Web, RDF pub- lishing has begun to find some traction through adoption of Linked Data best practices as follows: (i) use URIs as names for things (and not just documents); (ii) make those URIs dereferenceable via HTTP; (iii) return useful and relevant RDF content upon lookup of those URIs; (iv) include links to other datasets. The Linked Open Data project has advocated the goal of providing dereferenceable machine readable data in a common format (RDF), with emphasis on the re-use of URIs and inter- linkage between remote datasets—in so doing, the project has overseen exports from corporate entities (e.g., the BBC, BestBuy, Freebase), governmental bodies (e.g., the UK Government, the US government), existing structured datasets (e.g., DBPedia), so- cial networking sites (e.g., flickr, Twitter, livejournal), academic communities (e.g., DBLP, UniProt), as well as esoteric exports (e.g., Linked Open Numbers, Poképédia). This burgeoning web of structured data has succinctly been dubbed the ‘‘Web of Data’’. Considering the merge of these structured exports, at a conser- vative estimate there now exists somewhere in the order of thirty billion RDF triples published on the Web as Linked Data. 1 However, in this respect, size is not everything [73]. In particular, although the situation is improving, individual datasets are still not well- interlinked (cf. [72])—without sufficient linkage, the ideal of a ‘‘Web of Data’’ quickly disintegrates into the current reality of ‘‘Archipelagos of Datasets’’. There have been numerous works that have looked at bridging the archipelagos. Some works aim at aligning a small number of 1570-8268/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.websem.2011.11.002 Corresponding author. E-mail addresses: [email protected] (A. Hogan), antoine.zimmermann@ insa-lyon.fr (A. Zimmermann), [email protected] (J. Umbrich), axel. [email protected] (A. Polleres), [email protected] (S. Decker). 1 http://www4.wiwiss.fu-berlin.de/lodcloud/state/. Web Semantics: Science, Services and Agents on the World Wide Web 10 (2012) 76–110 Contents lists available at SciVerse ScienceDirect Web Semantics: Science, Services and Agents on the World Wide Web journal homepage: http://www.elsevier.com/locate/websem

Transcript of Scalable and distributed methods for entity matching, consolidation and disambiguation over linked...

Page 1: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Contents lists available at SciVerse ScienceDirect

Web Semantics Science Services and Agentson the World Wide Web

journal homepage ht tp wwwelsevier com locatewebsem

Scalable and distributed methods for entity matching consolidation anddisambiguation over linked data corpora

Aidan Hogan auArr Antoine Zimmermann b Juumlrgen Umbrich a Axel Polleres c Stefan Decker a

a Digital Enterprise Research Institute National University of Ireland Galway Irelandb INSA-Lyon LIRIS UMR5205 Villeurbanne F-69621 Francec Siemens AG Oumlsterreich Siemensstrasse 90 1210 Vienna Austria

a r t i c l e i n f o a b s t r a c t

Article historyAvailable online 18 November 2011

KeywordsEntity consolidationWeb dataLinked DataRDF

1570-8268$ - see front matter 2011 Elsevier BV Adoi101016jwebsem201111002

uArr Corresponding authorE-mail addresses aidanhoganderiorg (A Hog

insa-lyonfr (A Zimmermann) juergenumbrichdpolleressiemenscom (A Polleres) stefandeckerde

With respect to large-scale static Linked Data corpora in this paper we discuss scalable anddistributed methods for entity consolidation (aka smushing entity resolution object consolidationetc) to locate and process names that signify the same entity We investigate (i) a baselineapproach which uses explicit owl sameAs relations to perform consolidation (ii) extended entityconsolidation which additionally uses a subset of OWL 2 RLRDF rules to derive novel owlsameAsrelations through the semantics of inverse-functional properties functional-properties and(max-)cardinality restrictions with value one (iii) deriving weighted concurrence measuresbetween entities in the corpus based on shared inlinksoutlinks and attribute values usingstatistical analyses (iv) disambiguating (initially) consolidated entities based on inconsistencydetection using OWL 2 RLRDF rules Our methods are based upon distributed sorts and scans ofthe corpus where we deliberately avoid the requirement for indexing all data Throughout we offerevaluation over a diverse Linked Data corpus consisting of 1118 billion quadruples derived from adomain-agnostic open crawl of 3985 million RDFXML Web documents demonstrating thefeasibility of our methods at that scale and giving insights into the quality of the results forreal-world data

2011 Elsevier BV All rights reserved

1 Introduction

Over a decade since the dawn of the Semantic Web RDF pub-lishing has begun to find some traction through adoption of LinkedData best practices as follows

(i) use URIs as names for things (and not just documents)(ii) make those URIs dereferenceable via HTTP

(iii) return useful and relevant RDF content upon lookup of thoseURIs

(iv) include links to other datasets

The Linked Open Data project has advocated the goal ofproviding dereferenceable machine readable data in a commonformat (RDF) with emphasis on the re-use of URIs and inter-linkage between remote datasetsmdashin so doing the project has

ll rights reserved

an) antoinezimmermanneriorg (J Umbrich) axelriorg (S Decker)

overseen exports from corporate entities (eg the BBC BestBuyFreebase) governmental bodies (eg the UK Government theUS government) existing structured datasets (eg DBPedia) so-cial networking sites (eg flickr Twitter livejournal) academiccommunities (eg DBLP UniProt) as well as esoteric exports(eg Linked Open Numbers Pokeacutepeacutedia) This burgeoning webof structured data has succinctly been dubbed the lsquolsquoWeb ofDatarsquorsquo

Considering the merge of these structured exports at a conser-vative estimate there now exists somewhere in the order of thirtybillion RDF triples published on the Web as Linked Data1 Howeverin this respect size is not everything [73] In particular although thesituation is improving individual datasets are still not well-interlinked (cf [72])mdashwithout sufficient linkage the ideal of alsquolsquoWeb of Datarsquorsquo quickly disintegrates into the current reality oflsquolsquoArchipelagos of Datasetsrsquorsquo

There have been numerous works that have looked at bridgingthe archipelagos Some works aim at aligning a small number of

1 httpwww4wiwissfu-berlindelodcloudstate

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 77

related datasets (eg [485849]) thus focusing more on theoreti-cal considerations than scalability usually combining symbolic(eg reasoning with consistency checking) methods and similaritymeasures Some authors have looked at inter-linkage of domainspecific RDF datasets at various degrees of scale (eg [6159385446]) Further research has also looked at exploiting sharedterminological datamdashas well as explicitly asserted linksmdashto betterintegrate Linked Data collected from thousands or millions ofsources (eg [30501535]) the work presented herein falls mostclosely into this category One approach has tackled the problemfrom the publishers side detailing a system for manually specify-ing some (possibly heuristic) criteria for creating links betweentwo datasets [72] We leave further detailed related work toSection 9

In this paper we look at methods to provide better linkagebetween resources in particular focusing on finding equivalententities in the data Our notion of an entity is a representationof something being described by the data eg a person aplace a musician a protein etc We say that two entities areequivalent if they are coreferent eg refer to the same personplace etc2 Given a collection of datasets that speak about thesame referents using different identifiers we wish to identifythese coreferences and somehow merge the knowledge contribu-tion provided by the distinct parties We call this mergeconsolidation

In particular our work is inspired by the requirements of theSemantic Web Search Engine project [32] within which we aimto offer search and browsing over large static Linked Data corporacrawled from the Web3 The core operation of SWSE is to take userkeyword queries as input and to generate a ranked list of matchingentities as results After the core components of a crawler index anduser-interface we saw a clear need for a component that consoli-datesmdashby means of identifying and canonicalising equivalent identi-fiersmdashthe indexed corpus there was an observable lack of URIs suchthat coreferent blank-nodes were prevalent [30] even within thesame dataset and thus we observed many duplicate results referringto the same thing leading to poor integration of data from oursource documents

To take a brief example consider a simple example querylsquolsquoWHO DOES TIM BERNERS-LEE KNOWrsquorsquo Knowing that Tim uses the URItimblfoafi to refer to himself in his personal FOAF profiledocument and again knowing that the property foafknows

relates people to their (reciprocated) acquaintances wecan formulate this request as the SPARQL query [53] asfollows

SELECT person

WHERE timblfoafi foafknows person

However other publishers use different URIs to identify Timwhere to get more complete answers across these namingschemes the SPARQL query must use disjunctive UNION clausesfor each known URI here we give an example using a sample ofidentifiers extracted from a real Linked Data corpus (introducedlater)

2 Herein we avoid philosophical discussion on the notion of identity forinteresting discussion thereon see [26]

3 By static we mean that the system does not cater for updates this omissionallows for optimisations throughout the system Instead we aim at a cyclical indexingparadigm where new indexes are bulk-loaded in the background on separatemachines

SELECT person

WHERE timblfoafi foafknows personUNION identicau45563 foafknows personUNION dbpediaBerners-Lee foafknows personUNION dbpediaDr_Tim_Berners-Lee foafknows

personUNION dbpediaTim-Berners_Lee foafknows

personUNION dbpediaTimBL foafknows personUNION dbpediaTim_Berners-Lee foafknows

personUNION dbpediaTim_berners-lee foafknows

personUNION dbpediaTimbl foafknows personUNION dbpediaTimothy_Berners-Lee foafknows

personUNION yagorTim_Berners-Lee foafknows personUNION fbentim_berners-lee foafknows personUNION swidTim_Berners-Lee foafknows personUNION dblpperson100007 foafknows personUNION avtimblme foafknows personUNION bmpersonsTim+Berners-Lee foafknows

person

We see disparate URIs not only across data publishers but alsowithin the same namespace Clearly the expanded query quicklybecomes extremely cumbersome

In this paper we look at bespoke methods for identifying andprocessing coreference in a manner such that the resultant corpuscan be consumed as if more complete agreement on URIs was pres-ent in other words using standard query-answering techniqueswe want the enhanced corpus to return the same answers for theoriginal simple query as for the latter expanded query

Our core requirements for the consolidation component are asfollows

ndash the component must give high precision of consolidatedresults

ndash the underlying algorithm(s) must be scalablendash the approach must be fully automaticndash the methods must be domain agnostic

where a component with poor precision will lead to garbled fi-nal results merging unrelated entities where scalability is requiredto apply the process over our corpora typically in the order of a bil-lion statements (and which we feasibly hope to expand in future)where the scale of the corpora under analysis precludes any man-ual intervention and wheremdashfor the purposes of researchmdashthemethods should not give preferential treatment to any domain orvocabulary of data (other than core RDF(S)OWL terms) Alongsidethese primary requirements we also identify the following second-ary criteria

ndash the analysis should demonstrate high recallndash the underlying algorithm(s) should be efficient

where the consolidation component should identify as many(correct) equivalences as possible and where the algorithm shouldbe applicable in reasonable time Clearly the secondary require-ments are also important but they are superceded by those givenearlier where a certain trade-off exists we prefer a system that

78 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

gives a high percentage of correct results and leads to a clean con-solidated corpus over an approach that gives a higher percentage ofconsolidated results but leads to a partially garbled corpus simi-larly we prefer a system that can handle more data (is more scal-able) but may possibly have a lower throughput (is less efficient)4

Thus herein we revisit methods for scalable precise automaticand domain-agnostic entity consolidation over large static LinkedData corpora In order to make our methods scalable we avoid dy-namic on-disk index structures and instead opt for algorithms thatrely on sequential on-disk readswrites of compressed flat filesusing operations such as scans external sorts merge-joins andonly light-weight or non-critical in-memory indices In order tomake our methods efficient we demonstrate distributed imple-mentations of our methods over a cluster of shared-nothing com-modity hardware where our algorithms attempt to maximise theportion of time spent in embarrassingly parallel executionmdashieparallel independent computation without need for inter-machinecoordination In order to make our methods domain-agnostic anditfully-automatic we exploit the generic formal semantics of thedata described in RDF(S)OWL and also generic statistics derivablefrom the corpus In order to achieve high recall we incorporateadditional OWL features to find novel coreference through reason-ing Aiming at high precision we introduce methods that again ex-ploit the semantics and also the statistics of the data but toconversely disambiguate entities to defeat equivalences found inthe previous step that are unlikely to be true according to somecriteria

As such extending upon various reasoning and consolidationtechniques described in previous works [30313334] we now givea self-contained treatment of our results in this area In additionwe distribute the execution of all methods over a cluster of com-modity hardware we provide novel detailed performance andquality evaluation over a large real-world Linked Data corpus ofone billion statements and we also examine exploratory tech-niques for interlinking similar entities based on statistical analysesas well as methods for disambiguating and repairing incorrectcoreferences

In summary in this paper we

ndash provide some necessary preliminaries and describe our distrib-uted architecture (Section 2)

ndash characterise the 1 billion quadruple Linked Data corpus thatwill be used for later evaluation of our methods particularfocusing on the (re-)use of data-level identifiers in the corpus(Section 3)

ndash describe and evaluate our distributed base-line approach forconsolidation which leverages explicit owlsameAs relations(Section 4)

ndash describe and evaluate a distributed approach that extends con-solidation to consider a richer OWL semantics for consolidating(Section 5)

ndash present a distributed algorithm for determining weighted con-currence between entities using statistical analysis of predi-cates in the corpus (Section 6)

ndash present a distributed approach to disambiguate entitiesmdashiedetect likely erroneous consolidationmdashcombining the semanticsand statistics derivable from the corpus (Section 7)

ndash provide critical discussion (Section 8) render related work(Section 9) and conclude with discussion (Section 10)

4 Of course entity consolidation has many practical applications outside of ourreferential use-case SWSE and is useful in many generic query-answering secnariosmdashwe see these requirements as being somewhat fundamental to a consolidationcomponent to varying degrees

2 Preliminaries

In this section we provide some necessary preliminaries relat-ing to (i) RDF the structured format used in our corpora(Section 21) (ii) RDFSOWL which provides formal semantics toRDF data including the semantics of equality (Section 22) (iii)rules which are used to interpret the semantics of RDF and applyreasoning (Sections 23 24) (iv) authoritative reasoning whichcritically examines the source of certain types of Linked Data tohelp ensure robustness of reasoning (Section 25) (v) and OWL 2RLRDF rules a standard rule-based profile for supporting OWLsemantics a subset of which we use to deduce equality (Sec-tion 26) Throughout we attempt to preserve notation and termi-nology as prevalent in the literature

21 RDF

We briefly give some necessary notation relating to RDF con-stants and RDF triples see [28]

RDF constant Given the set of URI references U the set of blanknodes B and the set of literals L the set of RDF constants is denotedby C = U [ B [ L As opposed to the RDF-mandated existentialsemantics for blank-nodes we interpret blank-nodes as groundSkolem constants note also that we rewrite blank-node labels toensure uniqueness per document as per RDF merging [28]

RDF triple A triple t = (spo) 2 G = (U [ B) U C is called anRDF triple where s is called subject p predicate and o object Wecall a finite set of triples G G a graph This notion of a triple re-stricts the positions in which certain terms may appear We usethe phrase generalised triple where such restrictions are relaxedand where literals and blank-nodes are allowed to appear in thesubjectpredicate positions

RDF triple in contextquadruple Given c 2 U let http(c) denotethe possibly empty graph Gc given by retrieving the URI c throughHTTP (directly returning 200 Okay) An ordered pair (tc) with anRDF triple t = (spo) c 2U and t 2http(c) is called a triple in contextc We may also refer to (spoc) as an RDF quadruple or quad q withcontext c

22 RDFS OWL and owlsameAs

Conceptually RDF data is composed of assertional data (ie in-stance data) and terminological data (ie schema data) Assertionaldata define relationships between individuals provide literal-valued attributes of those individuals and declare individuals asmembers of classes Thereafter terminological data describe thoseclasses relationships (object properties) and attributes (datatypeproperties) and declaratively assigns them a semantics5 Withwell-defined descriptions of classes and properties reasoning canthen allow for deriving new knowledge including over assertionaldata

On the Semantic Web the RDFS [11] and OWL [42] standardsare prominently used for making statements with well-definedmeaning and enabling such reasoning Although primarilyconcerned with describing terminological knowledgemdashsuch assubsumption or equivalence between classes and propertiesetcmdashRDFS and OWL also contain features that operate on an asser-tional level The most interesting such feature for our purposes isowlsameAs a core OWL property that allows for defining equiva-lences between individuals (as well as classes and properties) Twoindividuals related by means of owlsameAs are interpreted asreferring to the same entity ie they are coreferent

5 Terminological and assertional data are not by any means necessarily disjointrthermore this distinction is not always required but is useful for our purposes

erein

Fuh

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 79

Interestingly OWL also contains other features (on a termino-logical level) that allow for inferring new owlsameAs relationsSuch features include

ndash inverse-functional properties which act as lsquolsquokey propertiesrsquorsquo foruniquely identifying subjects eg an isbn datatype-propertywhose values uniquely identifies books and other documents orthe object-property biologicalMotherOf which uniquelyidentifies the biological mother of a particular child

ndash functional properties which work in the inverse direction anduniquely identify objects eg the property hasBiological

Motherndash cardinality and max-cardinality restrictions which when given

a value of 1 uniquely identify subjects of a given class eg thevalue for the property spouse uniquely identifies a member ofthe class Monogamist

Thus we see that RDFS and OWL contain various features thatallow for inferring and supporting coreference between individu-als In order to support the semantics of such features we applyinference rules introduced next

23 Inference rules

Herein we formalise the notion of an inference rule as applicableover RDF graphs [33] We begin by defining the preliminary con-cept of a triple pattern of which rules are composed and whichmay contain variables in any position

Triple pattern basic graph pattern A triple pattern is a generalisedtriple where variables from the set V are allowed ie tv = (svpvov)2GV GV = (C [ V) (C [ V) (C [ V) We call a set (to be read asconjunction) of triple patterns GV GV a basic graph pattern We de-note the set of variables in graph pattern GV by VethGVTHORN

Intuitively variables appearing in triple patterns can be boundby any RDF term in C Such a mapping from variables to constantsis called a variable binding

Variable bindings Let X be the set of variable bindingV [ C V [ C that map every constant c 2 C to itself and every var-iable v 2 V to an element of the set C [ V A triple t is a binding of atriple pattern tv = (svpvov) iff there exists l 2X such thatt = l(tv) = (l(sv) l(pv) l(ov)) A graph G is a binding of a graph pat-tern GV iff there exists a mapping l 2X such that

Stv2GVlethtv THORN frac14 G

we use the shorthand lethGVTHORN frac14 G We use XethGVGTHORN frac14 fl j lethGVTHORNGlethvTHORN frac14 v if v R VethGVTHORNg to denote the set of variable bindingmappings for graph pattern GV in graph G that map variables out-side GV to themselves

We can now give the notion of an inference rule which is com-prised of a pair of basic graph patterns that form an lsquolsquoIF) THENrsquorsquological structure

Inference rule We define an inference rule (or often just rule) as apair r frac14 ethAnter ConrTHORN where the antecedent (or body) Anter GV

and the consequent (or head) Conr GV are basic graph patternssuch that all variables in the consequent appear in the antecedentVethConrTHORN VethAnterTHORN We write inference rules as Anter ) Conr

Finally rules allow for applying inference through rule applica-tions where the rule body is used to generate variable bindingsagainst a given RDF graph and where those bindings are appliedon the head to generate new triples that graph entails

Rule application and standard closure A rule application is theimmediate consequences TrethGTHORN frac14

Sl2XethAnter GTHORNethlethConrTHORN n lethAnterTHORNTHORN

of a rule r on a graph G accordingly for a ruleset RTRethGTHORN frac14

Sr2RTrethGTHORN Now let Githorn1 frac14 Gi [ TRethGiTHORN and G0 frac14 G the

exhaustive application of the TR operator on a graph G is then theleast fixpoint (the smallest value for n) such that Gn frac14 TRethGnTHORNWe call Gn the closure of G wrt ruleset R denoted as ClRethGTHORN or suc-cinctly G where the ruleset is obvious

The above closure takes a graph and a ruleset and recursivelyapplies the rules over the union of the original graph and the infer-ences until a fixpoint

24 T-split rules

In [3133] we formalised an optimisation based on a separationof terminological knowledge from assertional data when applyingrules Similar optimisations have also been used by other authorsto enable large-scale rule-based inferencing [747143] This opti-misation is based on the premise that in Linked Data (and variousother scenarios) terminological data speaking about classes andproperties is much smaller than assertional data speaking aboutindividuals Previous observations indicate that for a Linked Datacorpus in the order of a billion triples such terminological datacomprises of about 01 of the total data volume Additionally ter-minological data is very frequently accessed during reasoningHence seperating and storing the terminological data in memoryallows for more efficient execution of rules albeit at the cost ofpossible incompleteness [7433]

We now reintroduce some of the concepts relating to separatingterminological information [33] which we will use later whenapplying rule-based reasoning to support the semantics of OWLequality We begin by formalising some related concepts that helpdefine our notion of terminological information

Meta-class We consider a meta-class as a class whose membersare themselves (always) classes or properties Herein we restrictour notion of meta-classes to the set defined in RDF(S) andOWL specifications where examples include rdfPropertyrdfsClass owlDatatypeProperty owlFunctionalProp-

erty etc Note that eg owlThing and rdfsResource are notmeta-classes since not all of their members are classes or properties

Meta-property A meta-property is one that has a meta-class asits domain again we restrict our notion of meta-properties tothe set defined in RDF(S) and OWL specifications where examplesinclude rdfsdomain rdfsrange rdfssubClassOfowlhasKey owlinverseOf owloneOf owlonPropertyowlunionOf etc Note that rdftype owlsameAs rdfslabeleg do not have a meta-class as domain and so are not consideredas meta-properties

Terminological triple We define the set of terminological triplesT G as the union of

(i) triples with rdftype as predicate and a meta-class asobject

(ii) triples with a meta-property as predicate(iii) triples forming a valid RDF list whose head is the object of a

meta-property (eg a list used for owlunionOf etc)

The above concepts cover what we mean by terminological datain the context of RDFS and OWL Next we define the notion of tri-ple patterns that distinguish between these different categories ofdata

Terminologicalassertional pattern We refer to a terminological-triple-graph pattern as one whose instance can only be a termino-logical triple or resp a set thereof An assertional pattern is anypattern that is not terminological

Given the above notions of terminological datapatterns we cannow define the notion of a T-split inference rule that distinguishesterminological from assertional information

T-split inference rule Given a rule r frac14 ethAnter Conr) we define aT-split rule rs as the triple (AnteT

rs AnteGrs Con) where AnteT

rs isthe set of terminological patterns in Anter and AnteG

rs frac14Anter nAnteT

rs The body of T-split rules are divided into a set of patterns that

apply over terminological knowledge and a set of patterns that

80 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

apply over assertional (ie any) knowledge Such rules enable anoptimisation whereby terminological patterns are pre-bound togenerate a new larger set of purely assertional rules called T-ground rules We illustrate this with an example

Example Let Rprp-ifp denote the following rule

(p a owl InverseFunctionalProperty)(x1 p y)(x2 p y))(x1 owl sameAs x2)

When writing T-split rules we denote terminological patternsin the body by underlining Also we use lsquoarsquo as a convenient short-cut for rdftype which indicates class-membership Now take theterminological triples

(isbn a owlInverseFunctionalProperty)(mbox a owlInverseFunctionalProperty)

From these two triples we can generate two T-ground rules asfollows

(x1 isbn y)(x2 isbn y))(x1 owl sameAs x2)

(x1 mbox y)(x2 mbox y))(x1 owl sameAs x2)

These T-ground rules encode the terminological OWL knowledgegiven by the previous two triples and can be applied directly overthe assertional data eg describing specific books with ISBNnumbers

25 Authoritative T-split rules

Caution must be exercised when applying reasoning over arbi-trary data collected from the Web eg Linked Data In previousworks we have encountered various problems when naiumlvely per-forming rule-based reasoning over arbitrary Linked Data [3133]In particular third-party claims made about popular classes andproperties must be critically analysed to ensure their trustworthi-ness As a single example of the type of claims that can cause issueswith reasoning we found one document that defines nine propertiesas the domain of rdftype6 thereafter the semantics ofrdfsdomainmandate that everything which is a member of any class(ie almost every known resource) can be inferred to be a lsquolsquomemberrsquorsquo ofeach of the nine properties We found that naiumlve reasoning lead to200 more inferences than would be expected when consideringthe definitions of classes and properties as defined in their lsquolsquoname-space documentsrsquorsquo Thus in the general case performing rule-basedmaterialisation with respect to OWL semantics over arbitrary LinkedData requires some critical analysis of the source of data as per ourauthoritative reasoning algorithm Please see [9] for a detailed analysisof the explosion in inferences that occurs when authoritative analysisis not applied

Such observations prompted us to investigate more robustforms of reasoning Along these lines when reasoning over LinkedData we apply an algorithm called authoritative reasoning[143133] which critically examines the source of terminologicaltriples and conservatively rejects those that cannot be definitivelytrusted ie those that redefine classes andor properties outside ofthe namespace document

6 viz httpwwweiaonetrdf10

Dereferencing and authority We first give the functionhttpU 2G that maps a URI to an RDF graph it returns upon aHTTP lookup that returns the response code 200 Okay in the caseof failure the function maps to an empty RDF graph Next we de-fine the function redirU U that follows a (finite) sequence ofHTTP redirects given upon lookup of a URI or that returns theURI itself in the absence of a redirect or in the case of failure thisfunction first strips the fragment identifier (given after the lsquorsquo char-acter) from the original URI if present Then we give the dereferenc-ing function derefhttpredir as the mapping from a URI (a Weblocation) to an RDF graph it may provide by means of a given HTTPlookup that follows redirects Finally we can define the authorityfunctionmdashthat maps the URI of a Web document to the set of RDFterms it is authoritative formdashas follows

auth U 2C

u fc 2 U j redirethcTHORN frac14 ug [ BethhttpethsTHORNTHORNwhere B(G) denotes the set of blank-nodes appearing in a graph GIn other words a Web document is authoritative for URIs that dere-ference to it and the blank nodes it contains

We note that making RDF vocabulary URIs dereferenceable isencouraged in various best-practices [454] To enforce authorita-tive reasoning when applying rules with terminological and asser-tional patterns in the body we require that terminological triplesare served by a document authoritative for terms that are boundby specific variable positions in the rule

Authoritative T-split rule application Given a T-split rule and arule application involving a mapping l as before for the rule appli-cation to be authoritative there must additionally exist a l(v) suchthat v 2 VethAnteT THORN VethAnteGTHORN l(v) 2 auth(u) lethAnteT THORN httpethuTHORNWe call the set of variables that appear in both the terminologicaland assertional segment of the rule body (ie VethAnteT THORN VethAnteGTHORN) the set of authoritative variables for the rule Where a ruledoes not have terminological or assertional patterns we considerany rule-application as authoritative The notation of an authorita-tive T-ground rule follows likewise

Example Let Rprp-ifp denote the same rule as used in the previousexample and let the following two triples be given by the derefer-enced document of v1isbn but not v2mboxmdashie a documentauthoritative for the former term but not the latter

(v1isbn a owlInverseFunctionalProperty)(v2mbox a owlInverseFunctionalProperty)

The only authoritative variable appearing in both the termino-logical and assertional patterns of the body of Rprp-ifp is p boundby v1isbn and v2mbox above Since the document serving thetwo triples is authoritative for v1isbn the first triple is consid-ered authoritative and the following T-ground rule will begenerated

(x1 v1isbn y)(x2 v1isbn y))(x1 owlsameAs x2)

However since the document is not authoritative for v2mboxthe second T-ground rule will be considered non-authoritative anddiscarded

(x1 v2mbox y)(x2 v2mbox y))(x1 owlsameAs x2)

where third-party documents re-define remote terms in a way thataffects inferencing over instance data we filter such non-authoritative definitions

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 81

26 OWL 2 RLRDF rules

Inference rules can be used to (partially) support the semanticsof RDFS and OWL Along these lines the OWL 2 RLRDF [24] rulesetis a partial-axiomatisation of the OWL 2 RDF-Based Semanticswhich is applicable for arbitrary RDF graphs and constitutes anextension of the RDF Semantics [28] in other words the rulesetsupports a standardised profile of reasoning that partially coversthe complete OWL semantics Interestingly for our scenario thisprofile includes inference rules that support the semantics andinference of owlsameAs as discussed

First in Table 1 we provide the set of OWL 2 RLRDF rules thatsupport the (positive) semantics of owl sameAs axiomatising thesymmetry (rule eq-sym) and transitivity (rule eq-trans) of therelation To take an example rule eq-sym is as follows

(x owlsameAs y))(y owlsameAs x)

where if we know that (a owlsameAs b) the rule will inferthe symmetric relation (b owlsameAs a) Rule eq-trans oper-ates analogously both together allow for computing the transitivesymmetric closure of the equivalence relation Furthermore Table 1also contains the OWL 2 RLRDF rules that support the semantics ofreplacement (rules eq-rep-) whereby data that holds for one en-tity must also hold for equivalent entities

Note that we (optionally and in the case of later evaluation)choose not to support

(i) the reflexive semantics of owlsameAs since reflexiveowlsameAs statements will not lead to any consolidationor other non-reflexive equality relations and will produce alarge bulk of materialisations

(ii) equality on predicates or values for rdftype where we donot want possibly imprecise owlsameAs data to affectterms in these positions

Given that the semantics of equality is quadratic with respect tothe A-Box we apply a partial-materialisation approach that givesour notion of consolidation instead of materialising (ie explicitlywriting down) all possible inferences given by the semantics ofreplacement we instead choose one canonical identifier to repre-sent the set of equivalent terms thus effectively compressing thedata We have used this approach in previous works [30ndash32] andit has also appeared in related works in the literature [706] as acommon-sense optimisation for handling data-level equality Totake an example in later evaluation (cf Table 10) we find a validequivalence class (set of equivalent entities) with 33052 mem-bers materialising all pairwise non-reflexive owlsameAs state-ments would infer more than 1 billion owlsameAs relations(330522 33052) = 1092401652 further assuming that eachentity appeared in on average eg two quadruples we would inferan additional 2 billion of massively duplicated data By choosinga single canonical identifier we would instead only materialise100 thousand statements

Note that although we only perform partial materialisation wedo not change the semantics of equality our methods are sound

Table 1Rules that support the positive semantics of owl sameAs

OWL2RL Antecedent ConsequentAssertional

eq-sym x owlsameAs y y owlsameAs x eq-trans x owlsameAs y y owlsameAs z x owlsameAs z eq-rep-s s owlsameAs s0 s p o s0 p o eq-rep-o o owlsameAs o0 s p o s p o0

with respect to OWL semantics In addition alongside the partiallymaterialised data we provide a set of consolidated owl sameAs

relations (containing all of the identifiers in each equivalence class)that can be used to lsquolsquobackward-chainrsquorsquo the full inferences possiblethrough replacement (as required) Thus we do not consider thecanonical identifier as somehow lsquodefinitiversquo or superceding theother identifiers but merely consider it as representing the equiva-lence class7

Finally herein we do not consider consolidation of literalsone may consider useful applications eg for canonicalisingdatatype literals but such discussion is out of the currentscope

As previously mentioned OWL 2 RLRDF also contains rules thatuse terminological knowledge (alongside assertional knowledge)to directly infer owl sameAs relations We enumerate these rulesin Table 2 note that we italicise the labels of rules supporting fea-tures new to OWL 2 and that we list authoritative variables inbold8

Applying only these OWL 2 RLRDF rules may miss inference ofsome owlsameAs statements For example consider the exampleRDF data given in Turtle syntax [3] as follows

From the FOAF Vocabulary

foafhomepage rdfssubPropertyOf

foafisPrimaryTopicOf

foafisPrimaryTopicOf owlinverseOf

foafprimaryTopic

foafisPrimaryTopicOf a

owlInverseFunctionalProperty

From Example Document A

exAaxel foafhomepage lthttppolleresnetgt From Example Document B

an

apclsvo

lthttppolleresnetgt foafprimaryTopic exBapolleres

Inferred through prp-spo1

exAaxel foafisPrimaryTopicOf lthttppolleresnetgt

Inferred through prp-invexBapolleres foafisPrimaryTopicOf lthttppolleresnetgt

Subsequently inferred through prp-ifpexAaxel owlsameAs exBapolleres

exBapolleres owlsameAs exAaxel

The example uses properties from the prominent FOAF vocabu-lary which is used for publishing personal profiles as RDF Here weadditionally need OWL 2 RLRDF rules prp-inv and prp-spo1mdashhandling standard owlinverseOf and rdfssubPropertyOf

inferencing respectivelymdashto infer the owlsameAs relationentailed by the data

Along these lines we support an extended set of OWL 2 RLRDFrules that contain precisely one assertional pattern and for whichwe have demonstrated a scalable implementation called SAORdesigned for Linked Data reasoning [33] These rules are listed inTable 3 and as per the previous example can generateadditional inferences that indirectly lead to the derivation of newowl sameAs data We will use these rules for entity consolidationin Section 5

Lastly as we discuss later (particularly in Section 7) Linked Datais inherently noisy and prone to mistakes Although our methods

7 We may optionally consider non-canonical blank-node identifiers as redundand discard them

8 As discussed later our current implementation requires at least one variable topear in all assertional patterns in the body and so we do not support prp-key and-maxqc3 however these two features are not commonly used in Linked Datacabularies [29]

t

Table 2OWL 2 RLRDF rules that directly produce owl sameAs relations we denote authoritative variables with bold and italicise the labels of rules requiring new OWL 2 constructs

OWL2RL Antecedent Consequent

Terminological Assertional

prp-fp p a owlFunctionalProperty x p y1 y2 y1 owlsameAs y2

prp-ifp p a owlInverseFunctionalProperty x1 p y x2 p y x1 owlsameAs x2

cls-maxc2 x owlmaxCardinality 1 u a x y1 owlsameAs y2

x owlonProperty p u p y1 y2

cls-maxqc4 x owlmaxQualifiedCardinality 1 u a x y1 owlsameAs y2

x owlonProperty p u p y1 y2

x owlonClass owl Thing

Table 3OWL 2 RLRDF rules containing precisely one assertional pattern in the body with authoritative variables in bold (cf[33])

OWL2RL Antecedent Consequent

Terminological Assertional

prp-dom p rdfsdomain c x p y x a cprp-rng p rdfsrange c x p y y a cprp-symp p a owlSymmetricProperty x p y y p xprp-spo1 p1 rdfssubPropertyOf p2 x p1 y x p2 yprp-eqp1 p1 owlequivalentProperty p2 x p1 y x p2 yprp-eqp2 p1 owlequivalentProperty p2 x p2 y x p1 yprp-inv1 p1 owlinverseOf p2 x p1 y y p2 xprp-inv2 p1 owlinverseOf p2 x p2 y y p1 xcls-int2 c owlintersectionOf (c1 cn) x a c x a c1 cn

cls-uni c owlunionOf (c1 ci cn) x a ci x a ccls-svf2 x owlsomeValuesFrom owlThing owlonProperty p u p v u a xcls-hv1 x owlhasValue y owlonProperty p u a x u p ycls-hv2 x owlhasValue y owlonProperty p u p y u a xcax-sco c1 rdfssubClassOf c2 x a c1 x a c2

cax-eqc1 c1 owlequivalentClass c2 x a c1 x a c2

cax-eqc2 c1 owlequivalentClass c2 x a c2 x a c1

Table 4OWL 2 RLRDF rules used to detect inconsistency that we currently support wedenote authoritative variables with bold

OWL2RL Antecedent

Terminological Assertional

eq-diff1 ndash x owlsameAs yx owldifferentFrom y

prp-irp p a owlIrreflexiveProperty x p xprp-asyp p a owlAsymmetricProperty x p it y y p xprp-pdw p1owlpropertyDisjointWith p2 x p1 y p2 yprp-adp x a owlAllDisjointProperties u pi y pj y (i ndash j)

x owlmembers (p1 pn)cls-com c1 owlcomplementOf c2 x a c1 c2

cls-maxc1 x owlmaxCardinality 0 u a x p yx owlonProperty p

cls-maxqc2 x owlmaxQualifiedCardinality 0 u a x p yx owlonProperty px owlonClass owlThing

cax-dw c1 owldisjointWith c2 x a c1c2cax-adc x a owlAllDisjointClasses z a ci cj (i ndash j)

x owlmembers (c1 cn)

82 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

are sound (ie correct) with respect to formal OWL semanticssuch noise in our input data may lead to unintended consequenceswhen applying reasoning which we wish to minimise andorrepair Relatedly OWL contains features that allow for detectingformal contradictionsmdashcalled inconsistenciesmdashin RDF data Anexample of an inconsistency would be where something is amember of two disjoint classes such as foafPerson andfoafOrganization formally the intersection of such disjointclasses should be empty and they should not share membersAlong these lines OWL 2 RLRDF contains rules (with the specialsymbol false in the head) that allow for detecting inconsistencyin an RDF graph We use a subset of these consistency-checkingOWL 2 RLRDF rules later in Section 7 to try to automatically detectand repair unintended owlsameAs inferences the supportedsubset is listed in Table 4

27 Distribution architecture

To help meet our scalability requirements our methods areimplemented on a shared-nothing distributed architecture [64]over a cluster of commodity hardware The distributed frameworkconsists of a master machine that orchestrates the given tasks andseveral slave machines that perform parts of the task in parallel

The master machine can instigate the following distributedoperations

ndash Scatter partition on-disk data using some local split functionand send each chunk to individual slave machines for subse-quent processing

ndash Run request the parallel execution of a task by the slavemachinesmdashsuch a task either involves processing of somedata local to the slave machine or the coordinate method(described later) for reorganising the data under analysis

ndash Gather gathers chunks of output data from the slave swarmand performs some local merge function over the data

ndash Flood broadcast global knowledge required by all slavemachines for a future task

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 83

The master machine provides input data to the slave swarmprovides the control logic required by the distributed task (com-mencing tasks coordinating timing ending tasks) gathers andlocally perform tasks on global knowledge that the slave machineswould otherwise have to replicate in parallel and transmits glob-ally required knowledge

The slave machines as well as performing tasks in parallel canperform the following distributed operation (at the behest of themaster machine)

ndash Coordinate local data on each slave machine is partitionedaccording to some split function with the chunks sent toindividual machines in parallel each slave machine alsogathers the incoming chunks in parallel using some mergefunction

The above operation allows slave machines to reorganise(splitsendgather) intermediary amongst themselves the coor-dinate operation could be replaced by a pair of gatherscatteroperations performed by the master machine but we wish toavoid the channelling of all intermediary data through onemachine

Note that herein we assume that the input corpus is evenlydistributed and split across the slave machines and that theslave machines have roughly even specifications that is we donot consider any special form of load balancing but insteadaim to have uniform machines processing comparable data-chunks

We note that there is the practical issue of the master ma-chine being idle waiting for the slaves and more critically thepotentially large cluster of slave machines waiting idle for themaster machine One could overcome idle times with maturetask-scheduling (eg interleaving jobs) and load-balancing Froman algorithmic point of view removing the central coordinationon the master machine may enable better distributability Onepossibility would be to allow the slave machines to duplicatethe aggregation of global knowledge in parallel although thiswould free up the master machine and would probably takeroughly the same time duplicating computation wastes resourcesthat could otherwise be exploited by eg interleaving jobs Asecond possibility would be to avoid the requirement for globalknowledge and to coordinate upon the larger corpus (eg a coor-dinate function hashing on the subject and object of the data orperhaps an adaptation of the SpeedDate routing strategy [51])Such decisions are heavily influenced by the scale of the taskto perform the percentage of knowledge that is globally requiredhow the input data are distributed how the output data shouldbe distributed and the nature of the cluster over which it shouldbe performed and the task-scheduling possible The distributedimplementation of our tasks are designed to exploit a relativelysmall percentage of global knowledge which is cheap to coordi-nate and we choose to avoidmdashinsofar as reasonablemdashduplicatingcomputation

28 Experimental setup

Our entire code-base is implemented on top of standard Javalibraries We instantiate the distributed architecture using JavaRMI libraries and using the lightweight open-source Java RMIIOpackage9 for streaming data for the network

9 httpopenhmssourceforgenetrmiio

10 We observe eg a max FTP transfer rate of 38MBs between machines11 Please see [32 Appendix A] for further statistics relating to this corpus

All of our evaluation is based on nine machines connected byGigabit ethernet10 each with uniform specifications viz 22 GHzOpteron x86-64 4 GB main memory 160 GB SATA hard-disks run-ning Java 160_12 on Debian 504

3 Experimental corpus

Later in this paper we discuss the performance and results ofapplying our methods over a corpus of 1118 billion quadruplesderived from an open-domain RDFXML crawl of 3985 millionweb documents in mid-May 2010 (detailed in [3229]) The crawlwas conducted in a breadth-first manner extracting URIs from allpositions of the RDF data Individual URI queues were assigned todifferent pay-level-domains (aka PLDs domains that require pay-ment eg deriie datagovuk) where we enforced a polite-ness policy of accessing a maximum of two URIs per PLD persecond URIs with the highest inlink count per each PLD queuewere polled first

With regards the resulting corpus of the 1118 billion quads1106 billion are unique and 947 million are unique triples Thedata contain 23 thousand unique predicates and 105 thousandunique class terms (terms in the object position of an rdftype

triple) In terms of diversity the corpus consists of RDF from 783pay-level-domains [32]

Note that further details about the parameters of the crawl andstatistics about the corpus are available in [3229]

Now we discuss the usage of terms in a data-level position vizterms in the subject position or object position of non-rdftypetriples11 Since we do not consider the consolidation of literals orschema-level concepts we focus on characterising blank-node andURI re-use in such data-level positions thus rendering a picture ofthe structure of the raw data

We found 2863 million unique terms of which 1654 million(578) were blank-nodes 921 million (322) were URIs and289 million (10) were literals With respect to literals each hadon average 9473 data-level occurrences (by definition all in theobject position)

With respect to blank-nodes each had on average 5233 data-level occurrences Each occurred on average 0995 times in the ob-ject position of a non-rdftype triple with 31 million (187) notoccurring in the object position conversely each occurred on aver-age 4239 times in the subject position of a triple with 69 thousand(004) not occurring in the subject position Thus we summarisethat almost all blank-nodes appear in both the subject position andobject position but occur most prevalently in the former Impor-tantly note that in our input blank-nodes cannot be re-used acrosssources

With respect to URIs each had on average 941 data-leveloccurrences (18 the average for blank-nodes) with 4399 aver-age appearances in the subject position and 501 appearances inthe object positionmdash1985 million (2155) did not appear in anobject position whilst 5791 million (6288) did not appear in asubject position

With respect to re-use across sources each URI had a data-leveloccurrence in on average 47 documents and 1008 PLDsmdash562million (6102) of URIs appeared in only one document and913 million (9913) only appeared in one PLD Also re-use of URIsacross documents was heavily weighted in favour of use in the ob-ject position URIs appeared in the subject position in on average1061 documents and 0346 PLDs for the object position of

84 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

non-rdftype triples URIs occurred in on average 3996 docu-ments and 0727 PLDs

The URI with the most data-level occurrences (166 million)was httpidentica the URI with the most re-use acrossdocuments (appearing in 1793 thousand documents) washttpcreativecommonsorglicensesby30 the URIwith the most re-use across PLDs (appearing in 80 different do-mains) was httpwwwldoddscomfoaffoaf-a-maticAlthough some URIs do enjoy widespread re-use across differentdocuments and domains in Figs 1 and 2 we give the distribu-tion of re-use of URIs across documents and across PLDs wherea powerndashlaw relationship is roughly evidentmdashagain the majorityof URIs only appear in one document (61) or in one PLD(99)

From this analysis we can conclude that with respect to data-level terms in our corpus

ndash blank-nodes which by their very nature cannot be re-usedacross documents are 18 more prevalent than URIs

ndash despite a smaller number of unique URIs each one is used in(probably coincidentally) 18 more triples

ndash unlike blank-nodes URIs commonly only appear in either a sub-ject position or an object position

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100 1000 10000 100000 1e+006

num

ber o

f UR

Is

number of documents mentioning URI

Fig 1 Distribution of URIs and the number of documents they appear in (in a data-position)

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100

num

ber o

f UR

Is

number of PLDs mentioning URI

Fig 2 Distribution of URIs and the number of PLDs they appear in (in a data-position)

ndash each URI is re-used on average in 47 documents but usuallyonly within the same domainmdashmost external re-use is in theobject position of a triple

ndash 99 of URIs appear in only one PLD

We can conclude that within our corpusmdashitself a general crawlfor RDFXML on the Webmdashwe find that there is only sparse re-use of data-level terms across sources and particularly acrossdomains

Finally for the purposes of demonstrating performance acrossvarying numbers of machines we extract a smaller corpus com-prising of 100 million quadruples from the full evaluation corpusWe extract the sub-corpus from the head of the raw corpus sincethe data are ordered by access time polling statements from thehead roughly emulates a smaller crawl of data which shouldensure eg that all well-linked vocabularies are containedtherein

4 Base-line consolidation

We now present the lsquolsquobase-linersquorsquo algorithm for consolidationthat consumes asserted owlsameAs relations in the dataLinked Data best-practices encourage the provision ofowlsameAs links between exporters that coin different URIsfor the same entities

lsquolsquoIt is common practice to use the owlsameAs property for statingthat another data source also provides information about a specificnon-information resourcersquorsquo [7]

We would thus expect there to be a significant amount of expli-cit owlsameAs data present in our corpusmdashprovided directly bythe Linked Data publishers themselvesmdashthat can be directly usedfor consolidation

To perform consolidation over such data the first step is toextract all explicit owlsameAs data from the corpus We mustthen compute the transitive and symmetric closure of the equiv-alence relation and build equivalence classes sets of coreferentidentifiers Note that the set of equivalence classes forms a parti-tion of coreferent identifiers where due to the transitive andsymmetric semantics of owlsameAs each identifier can only ap-pear in one such class Also note that we do not need to considersingleton equivalence classes that contain only one identifierconsolidation need not perform any action if an identifier isfound only to be coreferent with itself Once the closure of as-serted owlsameAs data has been computed we then need tobuild an index that maps identifiers to the equivalence class inwhich it is contained This index enables lookups of coreferentidentifiers Next in the consolidated data we would like to col-lapse the data mentioning identifiers in each equivalence classto instead use a single consistent canonical identifier for eachequivalence class we must thus choose a canonical identifier Fi-nally we can scan the entire corpus and using the equivalenceclass index rewrite the original identifiers to their canonincalform As an optional step we can also add links between eachcanonical identifier and its coreferent forms using an artificalowlsameAs relation in the output thus persisting the originalidentifiers

The distributed algorithm presented in this section is basedon an in-memory equivalence class closure and index and isthe current method of consolidation employed by the SWSE sys-tem described previously in [32] Herein we briefly reintroducethe approach from [32] where we also add new performanceevaluation over varying numbers of machines in the distributedsetup present more discussion and analysis of the use ofowlsameAs in our Linked Data corpus and manually evaluatethe precision of the approach for an extended sample of one

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 85

thousand coreferent pairs In particular this section serves as abaseline for comparison against the extended consolidation ap-proach that uses richer reasoning features explored later inSection 5

41 High-level approach

Based on the previous discussion the approach is straight-forward

(i) scan the corpus and separate out all asserted owlsameAs

relations from the main body of the corpus(ii) load these relations into an in-memory index that encodes

the transitive and symmetric semantics of owlsameAs(iii) for each equivalence class in the index choose a canonical

term(iv) scan the corpus again canonicalising any term in the

subject position or object position of an rdftype

triple

Thus we need only index a small subset of the corpusmdashowlsameAs statementsmdashand can apply consolidation by meansof two scans

The non-trivial aspects of the algorithm are given by theequality closure and index To perform the in-memory transi-tive symmetric closure we use a traditional unionndashfind algo-rithm [6640] for computing equivalence partitions where (i)equivalent elements are stored in common sets such that eachelement is only contained in one set (ii) a map provides a find

function for looking up which set an element belongs to and(iii) when new equivalent elements are found the sets they be-long to are unioned The process is based on an in-memory mapindex and is detailed in Algorithm 1 where

(i) the eqc labels refer to equivalence classes and s and o refer toRDF subject and object terms

(ii) the function mapget refers to an identifier lookup on the in-memory index that should return the intermediary equiva-lence class associated with that identifier (ie find)

(iii) the function mapput associates an identifier with a newequivalence class

The output of the algorithm is an in-memory map from identi-fiers to their respective equivalence class

Next we must choose a canonical term for each equivalenceclass we prefer URIs over blank-nodes thereafter choosing a termwith the lowest alphabetical ordering a canonical term is thusassociated to each equivalent set Once the equivalence index hasbeen finalised we re-scan the corpus and canonicalise the datausing the in-memory index to service lookups

42 Distributed approach

Again distribution of the approach is fairly intuitive as follows

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract owlsameAs relations

(ii) Gather gather all owlsameAs relations onto the mastermachine and build the in-memory equality index

(iii) Floodrun send the equality index (in its entirety) to eachslave machine and apply the consolidation scan in parallel

As we will see in the next section the most expensive meth-odsmdashinvolving the two scans of the main corpusmdashcan be con-ducted in parallel

Algorithm 1

Building equivalence map [32]

Require SAMEAS DATA SA1 map 2 for (sowlsameAso) 2 SA ^ s ndash o do3 eqcs mapget(s)4 eqco mapget(o)5 if eqcs = ^ eqco = then6 eqcs[o so7 mapput(seqcs[o)8 mapput(oeqcs[o)9 else if eqcs = then

10 add s to eqco

11 mapput(seqco)12 else if eqco = then13 add o to eqcs

14 mapput(oeqcs)15 else if eqcs ndash eqco then16 if jeqcsj gt jeqcoj17 add all eqco into eqcs

18 for eo 2 eqco do19 mapput(eoeqcs)20 end for21 else22 add all eqcs into eqco

23 for es 2 eqcs do24 mapput(eseqco)25 end for26 end if27 end if28 end for

43 Performance evaluation

We applied the distributed base-line consolidation over our cor-pus with the aformentioned procedure and setup The entire con-solidation process took 633 min with the bulk of time taken asfollows the first scan extracting owlsameAs statements took125 min with an average idle time for the servers of 11 s(14)mdashie on average the slave machines spent 14 of the timeidly waiting for peers to finish Transferring aggregating and load-ing the owlsameAs statements on the master machine took84 min The second scan rewriting the data according to thecanonical identifiers took in total 423 min with an average idletime of 647 s (25) for each machine at the end of the roundThe slower time for the second round is attributable to the extraoverhead of re-writing the GZip-compressed data to disk as op-posed to just reading

In the rightmost column of Table 5 (full-8) we give a break-down of the timing for the tasks over the full corpus using eightslave machines and one master machine Independent of the num-ber of slaves we note that the master machine required 85 min forcoordinating globally-required owlsameAs knowledge and thatthe rest of the task time is spent in embarrassingly parallel execu-tion (amenable to reduction by increasing the number of ma-chines) For our setup the slave machines were kept busy for onaverage 846 of the total task time of the idle time 87 wasspent waiting for the master to coordinate the owlsameAs dataand 13 was spent waiting for peers to finish their task due tosub-optimal load balancing The master machine spent 866 ofthe task idle waiting for the slaves to finish

In addition Table 5 also gives performance results for onemaster and varying numbers of slave machines (1248) over the

Table 5Breakdown of timing of distributed baseline consolidation

Category 1 2 4 8 Full-8

Min Min Min min Min

MasterTotal execution time 422 1000 223 1000 119 1000 67 1000 633 1000Executing 05 11 05 22 05 40 05 75 85 134

Aggregating owl sameAs 04 11 05 22 05 40 05 75 84 133Miscellaneous 01 01 01 04 01 01

Idle (waiting for slaves) 418 989 218 977 114 958 62 921 548 866

SlavesAvg Executing (total) 418 989 218 977 114 958 62 921 535 846

Extract owl sameAs 90 214 45 202 22 184 11 168 123 195Consolidate 327 775 171 769 88 735 47 705 412 651

Avg Idle 05 11 07 29 10 82 08 127 98 154Waiting for peers 00 00 01 07 05 40 03 47 13 20Waiting for master 05 11 05 23 05 42 05 79 85 134

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000

num

ber o

f cla

sses

equivalence class size

Fig 3 Distribution of sizes of equivalence classes on loglog scale (from [32])

1

10

100

1000

10000

100000

1e+006

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

Fig 4 Distribution of the number of PLDs per equivalence class on loglog scale

2 For example see the coreference results given by httpwwwrkbexplorercoma m e A s u r i = h t t p w w w a c m r k b e x p l o r e r c o m i d p e r s o n - 5 3 2 9 2 -2877d02973d0d01e8f29c7113776e7e which at the time of writing correspond to36 out of the 443 equivalent URIs found for Dieter Fensel

86 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

100 million quadruple corpus Note that in the table we use thesymbol to indicate a negligible value (0 lt lt 005) We see thatas the number of slave machines increases so too does the percent-age of time taken to aggregate global knowledge on the master ma-chine (the absolute time stays roughly stable) This indicates thatfor a high number of machines the aggregation of global knowl-edge will eventually become a bottleneck and an alternative distri-bution strategy may be preferable subject to further investigationOverall however we see that the total execution times roughlyhalve when the number of slave machines are doubled we observea 0528 0536 and 0561 reduction in total task executiontime moving from 1 slave machine to 2 2 to 4 and 4 to 8 respec-tively (here a value of 05 would indicate linear scaling out)

44 Results evaluation

We extracted 1193 million raw owlsameAs quadruples fromthe full corpus Of these however there were only 377 million un-ique triples After closure the data formed 216 million equivalenceclasses mentioning 575 million terms (624 of URIs)mdashan averageof 265 elements per equivalence class Of the 575 million termsonly 4156 were blank-nodes Fig 3 presents the distribution ofsizes of the equivalence classes where the largest equivalenceclass contains 8481 equivalent entities and 16 million (741)equivalence classes contain the minimum two equivalent identifi-ers Fig 4 shows a similar distribution but for the number of PLDsextracted from URIs in each equivalence class Interestingly themajority (571) of equivalence classes contained identifiers from

more than one PLD where the most diverse equivalence class con-tained URIs from 32 PLDs

Table 6 shows the canonical URIs for the largest five equiva-lence classes we manually inspected the results and showwhether or not the results were verified as correctincorrect In-deed results for class 1 and 2 were deemed incorrect due toover-use of owlsameAs for linking drug-related entities in theDailyMed and LinkedCT exporters Results 3 and 5 were verifiedas correct consolidation of prominent Semantic Web relatedauthors resp Dieter Fensel and Rudi Studermdashauthors are gi-ven many duplicate URIs by the RKBExplorer coreference index12

Result 4 contained URIs from various sites generally referring tothe United States mostly from DBPedia and LastFM With respectto the DBPedia URIs these (i) were equivalent but for capitilisationvariations or stop-words (ii) were variations of abbreviations or va-lid synonyms (iii) were different language versions (eg dbpediaEtats_Unis) (iv) were nicknames (eg dbpediaYankee_land)(v) were related but not equivalent (eg dbpediaAmerican_Civilization) (vi) were just noise (eg dbpediaLOL_Dean)

Besides the largest equivalence classes which we have seen areprone to errors perhaps due to the snowballing effect of the tran-sitive and symmetric closure we also manually evaluated a ran-dom sample of results For the sampling we take the closed

1

s24

Table 6Largest five equivalence classes (from [32])

Canonical term (lexically first in equivalence class) Size Correct

1 httpbio2rdforgdailymed_drugs1000 84812 httpbio2rdforgdailymed_drugs1042 8003 httpacmrkbexplorercomidperson-53292-22877d02973d0d01e8f29c7113776e7e 443 U

4 httpagame2teachcomddb61cae0e083f705f65944cc3bb3968ce3f3ab59-ge_1 353 U5 httpwwwacmrkbexplorercomidperson-236166-1b4ef5fdf4a5216256064c45a8923bc9 316 U

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 87

equivalence classes extracted from the full corpus pick an identi-fier (possibly non-canonical) at random and then randomly pickanother coreferent identifier from its equivalence class We thenretrieve the data associated for both identifiers and manually in-spect them to see if they are in fact equivalent For each pairwe then select one of four options SAME DIFFERENT UNCLEAR TRIVIALLY

SAME We used the UNCLEAR option sparingly and only where therewas not enough information to make an informed decision or fordifficult subjective choices note that where there was not enoughinformation in our corpus we tried to dereference more informa-tion to minimise use of UNCLEAR in terms of difficult subjectivechoices one example pair we marked unclear was dbpediaFolk-core and dbpediaExperimental_folk We used the TRIVIALLY

SAME option to indicate that an equivalence is purely lsquolsquosyntacticrsquorsquowhere one identifier has no information other than being the targetof a owlsameAs link although the identifiers are still coreferentwe distinguish this case since the equivalence is not so lsquolsquomeaning-fulrsquorsquo For the 1000 pairs we choose option TRIVIALLY SAME 661 times(661) SAME 301 times (301) DIFFERENT 28 times (28) and UN-

CLEAR 10 times (1) In summary our manual verification of thebaseline results puts the precision at 972 albeit with manysyntatic equivalences found Of those found to be incorrect manywere closely-related DBpedia resources that were linked to by thesame resource on the Freebase exporter an example of this wasdbpediaRock_Hudson and dbpediaRock_Hudson_filmogra-phy having owlsameAs links from the same Freebase conceptfbrock_hudson

Moving on in Table 7 we give the most frequently co-occurringPLD-pairs in our equivalence classes where datasets resident onthese domains are lsquolsquoheavilyrsquorsquo interlinked with owl sameAs rela-tions We italicise the indexes of inter-domain links that were ob-served to often be purely syntactic

With respect to consolidation identifiers in 786 million subjectpositions (7 of subject positions) and 232 million non-rdftype-object positions (26) were rewritten giving a total of1019 million positions rewritten (51 of total rewritable posi-tions) The average number of documents mentioning each URIrose slightly from 4691 to 4719 (a 06 increase) due to consoli-dation and the average number of PLDs also rose slightly from1005 to 1007 (a 02 increase)

Table 7Top 10 PLD pairs co-occurring in the equivalence classes with number of equivalenceclasses they co-occur in

PLD PLD Co-occur

1 rdfizecom uriburnercom 5066232 dbpediaorg freebasecom 1870543 bio2rdforg purlorg 1853924 locgov infolcauthorities

a 1666505 l3sde rkbexplorercom 1258426 bibsonomyorg l3sde 1258327 bibsonomyorg rkbexplorercom 1258308 dbpediaorg mpiide 998279 freebasecom mpiide 8961010 dbpediaorg umbelorg 65565

a In fact these were within the same PLD

5 Extended reasoning consolidation

We now look at extending the baseline approach to includemore expressive reasoning capabilities in particular using OWL 2RLRDF rules (introduced previously in Section 26) to infer novelowlsameAs relations that can then be used for consolidationalongside the explicit relations asserted by publishers

As described in Section 22 OWL provides various featuresmdashincluding functional properties inverse-functional properties andcertain cardinality restrictionsmdashthat allow for inferring novelowlsameAs data Such features could ease the burden on publish-ers of providing explicit owlsameAs mappings to coreferent iden-tifiers in remote naming schemes For example inverse-functionalproperties can be used in conjunction with legacy identificationschemesmdashsuch as ISBNs for books EAN UCC-13 or MPN for prod-ucts MAC addresses for network-enabled devices etcmdashto boot-strap identity on the Web of Data within certain domains suchidentification values can be encoded as simple datatype stringsthus bypassing the requirement for bespoke agreement or map-pings between URIs Also lsquolsquoinformation resourcesrsquorsquo with indigenousURIs can be used for indirectly identifying related resources towhich they have a functional mapping where examples includepersonal email-addresses personal homepages WIKIPEDIA articlesetc

Although Linked Data literature has not explicitly endorsed orencouraged such usage prominent grass-roots efforts publishingRDF on the Web rely (or have relied) on inverse-functional proper-ties for maintaining consistent identity For example in the FriendOf A Friend (FOAF) community a technique called smushing wasproposed to leverage such properties for identity serving as anearly precursor to methods described herein13

Versus the baseline consolidation approach we must first ap-ply some reasoning techniques to infer novel owlsameAs rela-tions Once we have the inferred owlsameAs data materialisedand merged with the asserted owlsameAs relations we can thenapply the same consolidation approach as per the baseline (i) ap-ply the transitive symmetric closure and generate the equivalenceclasses (ii) pick canonical identifiers for each class and (iii) re-write the corpus However in contrast to the baseline we expectthe extended volume of owlsameAs data to make in-memorystorage infeasible14 Thus in this section we avoid the need tostore equivalence information in-memory where we instead inves-tigate batch-processing methods for applying an on-disk closure ofthe equivalence relations and likewise on-disk methods for rewrit-ing data

Early versions of the local batch-processing [31] techniques andsome of the distributed reasoning techniques [33] are borrowedfrom previous works Herein we reformulate and combine thesetechniques into a complete solution for applying enhanced consol-idation in a distributed scalable manner over large-scale Linked

13 cf httpwikifoaf-projectorgwSmushing retr 2011012214 Further note that since all machines must have all owlsameAs information

adding more machines does not increase the capacity for handling more suchrelations the scalability of the baseline approach is limited by the machine with theleast in-memory capacity

88 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Data corpora We provide new evaluation over our 1118 billionquadruple evaluation corpus presenting new performance resultsover the full corpus and for a varying number of slave machinesWe also analyse in depth the results of applying the techniquesover our evaluation corpus analysing the applicability of ourmethods in such scenarios We contrast the results given by thebaseline and extended consolidation approaches and again manu-ally evaluate the results of the extended consolidation approach for1000 pairs found to be coreferent

51 High-level approach

In Table 2 we provide the pertinent rules for inferring new owl

sameAs relations from the data However after analysis of thedata we observed that no documents used the owlmaxQuali-

fiedCardinality construct required for the cls-maxqc rulesand that only one document defined one owlhasKey axiom15

involving properties with less than five occurrences in the datamdashhence we leave implementation of these rules for future work andnote that these new OWL 2 constructs have probably not yet had timeto find proper traction on the Web Thus on top of inferencing involv-ing explicit owlsameAs we are left with rule prp-fp which supportsthe semantics of properties typed owlFunctionalProperty andrule prp-ifp which supports the semantics of properties typedowlInverseFunctionalProperty and rule cls-maxc1 whichsupports the semantics of classes with a specified cardinality of 1for some defined property (a class restricted version of the func-tional-property inferencing)16

Thus we look at using OWL 2 RLRDF rules prp-fp prp-ifp andcls-maxc2 for inferring new owl sameAs relations between indi-vidualsmdashwe also support an additional rule that gives an exact car-dinality version of cls-maxc217 Herein we refer to these rules asconsolidation rules

We also investigate pre-applying more general OWL 2 RLRDFreasoning over the corpus to derive more complete results wherethe ruleset is available in Table 3 and is restricted to OWL 2 RLRDFrules with one assertional pattern [33] unlike the consolidationrules these general rules do not require the computation of asser-tional joins and thus are amenable to execution by means of a sin-gle-scan of the corpus We have demonstrated this profile ofreasoning to have good competency with respect to the features ofRDFS and OWL used in Linked Data [29] These general rules mayindirectly lead to the inference of additional owlsameAs relationsparticularly when combined with the consolidation rules of Table 2

Once we have used reasoning to derive novel owlsameAs

data we then apply an on-disk batch processing technique toclose the equivalence relation and to ultimately consolidate thecorpus

Thus our high-level approach is as follows

(i) extract relevant terminological data from the corpus(ii) bind the terminological patterns in the rules from this data

thus creating a larger set of general rules with only oneassertional pattern and identifying assertional patterns thatare useful for consolidation

(iii) apply general-rules over the corpus and buffer any inputinferred statements relevant for consolidation to a new file

(iv) derive the closure of owlsameAs statements from the con-solidation-relevant dataset

15 httphuemerlstadlernetrolerhrdf16 Note that we have presented non-distributed batch-processing execution of

these rules at a smaller scale (147 million quadruples) in [31]17 Exact cardinalities are disallowed in OWL 2 RL due to their effect on the formal

proposition of completeness underlying the profile but such considerations are mootin our scenario

(v) apply consolidation over the main corpus with respect to theclosed owlsameAs data

In Step (i) we extract terminological datamdashrequired for applica-tion of our rulesmdashfrom the main corpus As discussed in Section 25to help ensure robustness we apply authoritative reasoning inother words at this stage we discard any third-party terminologi-cal statements that affect inferencing over instance data for remoteclasses and properties

In Step (ii) we then compile the authoritative terminologicalstatements into the rules to generate a set of T-ground (purelyassertional) rules which can then be applied over the entirety ofthe corpus [33] As mentioned in Section 26 the T-ground generalrules only contain one assertional pattern and thus are amenableto execution by a single scan eg given the statement

homepage rdfssubPropertyOf isPrimaryTopicOf

we would generate a T-ground rulex homepage y

)x isPrimaryTopicOf y

which does not require the computation of joins We also bindthe terminological patterns of the consolidation rules in Table 2however these rules cannot be performed by means of a singlescan over the data since they contain multiple assertional pat-terns in the body Thus we instead extract triple patternssuch as

x isPrimaryTopicOf y

which may indicate data useful for consolidation (note that hereisPrimaryTopicOf is assumed to be inverse-functional)

In Step (iii) we then apply the T-ground general rules over thecorpus with a single scan Any input or inferred statements match-ing a consolidation-relevant patternmdashincluding any owlsameAs

statements foundmdashare buffered to a file for later join computationSubsequently in Step (iv) we must now compute the on-disk

canonicalised closure of the owlsameAs statements In particularwe mainly employ the following three on-disk primitives

(i) sequential scans of flat files containing line-delimitedtuples18

(ii) external-sorts where batches of statements are sorted inmemory the sorted batches written to disk and the sortedbatches merged to the final output and

(iii) merge-joins where multiple sets of data are sorted accordingto their required join position and subsequently scanned inan interleaving manner that aligns on the join position andwhere an in-memory join is applied for each individual joinelement

Using these primitives to compute the owlsameAs closureminimises the amount of main memory required where we havepresented similar approaches in [3031]

First assertional memberships of functional properties inverse-functional properties and cardinality restrictions (both propertiesand classes) are written to separate on-disk files For functional-property and cardinality reasoning a consistent join variable forthe assertional patterns is given by the subject position for in-verse-functional-property reasoning a join variable is given bythe object position19 Thus we can sort the former sets of data

8 These files are G-Zip compressed flat files of N-Triple-like syntax encodingrbitrary length tuples of RDF constants9 Although a predicate-position join is also available we prefer object-positionins that provide smaller batches of data for the in-memory join

1

a1

jo

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 89

according to subject and perform a merge-join by means of a linearscan thereafter the same procedure applies to the latter set sortingand merge-joining on the object position Applying merge-join scanswe produce new owlsameAs statements

Both the originally asserted and newly inferred owlsameAs

relations are similarly written to an on-disk file over which wenow wish to perform the canonicalised symmetrictransitive clo-sure We apply a similar method again leveraging external sortsand merge-joins to perform the computation (herein we sketchand point the interested reader to [31 Section 46]) In the follow-ing we use gt and lt to denote lexical ordering SA as a shortcut forowlsameAs a b c etc to denote members of U [ B such thata lt b lt c and define URIs to be lexically lower than blank nodesThe process is as follows

(i) we only materialise symmetric equality relations thatinvolve a (possibly intermediary) canonical term chosenby a lexical ordering given b SA a we materialise a SA bgiven a SA b SA c we materialise the relations a SA b aSA c and their inverses but do not materialise b SA c orits inverse

(ii) transitivity is supported by iterative merge-join scans

20 One could instead build an on-disk map for equivalence classes and pivoelements and follow a consolidation method similar to the previous section over theunordered data however we would expect such an on-disk index to have a low cachehit-rate given the nature of the data which would lead to many (slow) disk seeks Analternative approach might be to hash-partition the corpus and equality index overdifferent machines however this would require a non-trivial minimum amount omemory on the cluster

ndash in the scan if we find c SA a (sorted by object) and c SA d(sorted naturally) we infer a SA d and drop the non-canonical c SA d (and d SA c)

ndash at the end of the scan newly inferred triples are marked and merge-joined

into the main equality data any triples echoing an earlier inference are ignored dropped non-canonical statements are removed

ndash the process is then iterative in the next scan if we find dSA a and d SA e we infer a SA e and e SA a

ndash inferences will only occur if they involve a statementadded in the previous iteration ensuring that inferencesteps are not re-computed and that the computation willterminate

(iii) the above iterations stop when a fixpoint is reached andnothing new is inferred

(iv) the process of reaching a fixpoint is accelerated using avail-able main-memory to store a cache of partial equalitychains

The above steps follow well-known methods for transtive clo-sure computation modified to support the symmetry ofowlsameAs and to use adaptive canonicalisation in order toavoid quadratic output With respect to the last item we useAlgorithm 1 to derive lsquolsquobatchesrsquorsquo of in-memory equivalencesand when in-memory capacity is achieved we write thesebatches to disk and proceed with on-disk computation this isparticularly useful for computing the small number of longequality chains which would otherwise require sorts andmerge-joins over all of the canonical owlsameAs data currentlyderived and where the number of iterations would otherwise bethe length of the longest chain The result of this process is a setof canonicalised equality relations representing the symmetrictransitive closure

Next we briefly describe the process of canonicalising data withrespect to this on-disk equality closure where we again use exter-nal-sorts and merge-joins First we prune the owlsameAs indexto only maintain relations s1 SA s2 such that s1 gt s2 thus given s1

SA s2 we know that s2 is the canonical identifier and s1 is to berewritten We then sort the data according to the position thatwe wish to rewrite and perform a merge-join over both sets ofdata buffering the canonicalised data to an output file If we wantto rewrite multiple positions of a file of tuples (eg subject and

object) we must rewrite one position sort the results by the sec-ond position and then rewrite the second position20

Finally note that in the derivation of owlsameAs from the con-solidation rules prp-fp prp-ifp cax-maxc2 the overall processmay be iterative For example consider

dblpAxel foafisPrimaryTopicOf lthttppolleresnetgt

axelme foafisPrimaryTopicOf lthttpaxelderiiegt

lthttppolleresnetgt owlsameAs lthttpaxelderiiegt

from which the conclusion that dblpAxel is the same asaxelme holds We see that new owlsameAs relations (either as-serted or derived from the consolidation rules) may in turn lsquolsquoalignrsquorsquoterms in the join position of the consolidation rules leading to newequivalences Thus for deriving the final owlsameAs we require ahigher-level iterative process as follows

(i) initially apply the consolidation rules and append theresults to a file alongside the asserted owl sameAs state-ments found

(ii) apply the initial closure of the owlsameAs data(iii) then iteratively until no new owlsameAs inferences are

found

ndash rewrite the join positions of the on-disk files containing

the data for each consolidation rule according to the cur-rent owlsameAs data

ndash derive new owlsameAs inferences possible through theprevious rewriting for each consolidation rule

ndash re-derive the closure of the owlsameAs data includingthe new inferences

The final closed file of owlsameAs data can then be re-used torewrite the main corpus in two sorts and merge-join scans oversubject and object

52 Distributed approach

The distributed approach follows quite naturally from theprevious discussion As before we assume that the input dataare evenly pre-distributed over the slave machines (in anyarbitrary ordering) where we can then apply the followingprocess

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract relevant terminologicalknowledge

(ii) Gather gather terminological data onto the master machineand thereafter bind the terminological patterns of the gen-eralconsolidation rules

(iii) Flood flood the rules for reasoning and the consolidation-relevant patterns to all slave machines

(iv) Run apply reasoning and extract consolidation-relevantstatements from the input and inferred data

(v) Gather gather all consolidation statements onto the mastermachine then in parallel

t

f

Table 8Breakdown of timing of distributed extended consolidation wreasoning where identifies the masterslave tasks that run concurrently

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 4543 1000 2302 1000 1275 1000 813 1000 7404 1000Executing 299 66 303 131 325 255 301 370 2436 329

Aggregate Consolidation Relevant Data 43 09 45 20 47 37 46 57 299 40Closing owlsameAs

244 54 245 106 254 199 244 300 2112 285Miscellaneous 12 03 13 05 24 19 11 13 25 03

Idle (waiting for slaves) 4245 934 1999 869 950 745 512 630 4968 671

SlaveAvg Executing (total) 4489 988 2212 961 1116 875 565 694 5991 809

Extract Terminology 444 98 227 99 118 92 69 84 494 67Extract Consolidation Relevant Data 955 210 483 210 243 191 132 162 1379 186Initial Sort (by subject) 980 216 483 210 237 186 116 142 1338 181Consolidation 2110 464 1019 443 519 407 249 306 2780 375

Avg Idle 55 12 89 39 379 297 236 290 1413 191Waiting for peers 00 00 32 14 71 56 63 77 375 51Waiting for master 55 12 58 25 308 241 173 212 1038 140

90 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

ndash Local compute the closure of the consolidation rules andthe owlsameAs data on the master machine

ndash Run each slave machine sorts its fragment of the maincorpus according to natural order (spoc)

1 At least in terms of pure quantity However we do not give an indication of theuality or importance of those few equivalences we miss with this approximationhich may well be application specific

(vi) Flood send the closed owlsameAs data to the slavemachines once the distributed sort has been completed

(vii) Run each slave machine then rewrites the subjects of theirsegment of the corpus subsequently sorts the rewritten databy object and then rewrites the objects (of non-rdftypetriples) with respect to the closed owlsameAs data

53 Performance evaluation

Applying the above process to our 1118 billion quadruple corpustook 1234 h Extracting the terminological data took 114 h with anaverage idle time of 19 min (277) (one machine took 18 minlonger than the rest due to processing a large ontology containing2 million quadruples [32]) Merging and aggregating the termino-logical data took roughly 1 min Applying the reasoning andextracting the consolidation relevant statements took 234 h withan average idle time of 25 min (18) Aggregating and mergingthe consolidation relevant statements took 299 min Thereafterlocally computing the closure of the consolidation rules and theequality data took 352 h with the computation requiring two itera-tions overall (the minimum possiblemdashthe second iteration did notproduce any new results) Concurrent to the previous step the paral-lel sort of remote data by natural order took 233 h with an averageidle time of 6 min (43) Subsequent parallel consolidation of thedata took 48 h with 10 min (35) average idle timemdashof this 19of the time was spent consolidating the pre-sorted subjects 60of the time was spent sorting the rewritten data by object and21 of the time was spent consolidating the objects of the data

As before Table 8 summarises the timing of the task Focusing onthe performance for the entire corpus the master machine took406 h to coordinate global knowledge constituting the lower boundon time possible for the task to execute with respect to increasingmachines in our setupmdashin future it may be worthwhile to investigatedistributed strategies for computing the owlsameAs closure(which takes 285 of the total computation time) but for the mo-ment we mitigate the cost by concurrently running a sort on theslave machines thus keeping the slaves busy for 634 of the timetaken for this local aggregation step The slave machines were onaverage busy for 809 of the total task time of the idle time733 was spent waiting for the master machine to aggregate theconsolidation relevant data and to finish the closure of owl sameAs

data and the balance (267) was spent waiting for peers to finish(mostly during the extraction of terminological data)

With regards performance evaluation over the 100 million qua-druple corpus for a varying number of slave machines perhapsmost notably the average idle time of the slave machines increasesquite rapidly as the number of machines increases In particular by4 machines the slaves can perform the distributed sort-by-subjecttask faster than the master machine can perform the local owlsameAs closure The significant workload of the master machinewill eventually become a bottleneck as the number of slave ma-chines increases further This is also noticeable in terms of overalltask time the respective time savings as the number of slave ma-chines doubles (moving from 1 slave machine to 8) are 0507 0554 and 0638 respectively

Briefly we also ran the consolidation without the general rea-soning rules (Table 3) motivated earlier With respect to perfor-mance the main variations were given by (i) the extraction ofconsolidation relevant statementsmdashthis time directly extractedfrom explicit statements as opposed to explicit and inferred state-mentsmdashwhich took 154 min (11 of the time taken including thegeneral reasoning) with an average idle time of less than one min-ute (6 average idle time) (ii) local aggregation of the consolida-tion relevant statements took 17 min (569 of the time takenpreviously) (iii) local closure of the owlsameAs data took318 h (904 of the time taken previously) The total time savedequated to 28h (227) where 333 min were saved from coordi-nation on the master machine and 225 h were saved from parallelexecution on the slave machines

54 Results evaluation

In this section we present the results of the consolidationwhich included the general reasoning step in the extraction ofthe consolidation-relevant statements In fact we found that theonly major variation between the two approaches was in theamount of consolidation-relevant statements collected (discussedpresently) where the changes in the total amounts of coreferentidentifiers equivalence classes and canonicalised terms were infact negligible (lt01) Thus for our corpus extracting only as-serted consolidation-relevant statements offered a very closeapproximation of the extended reasoning approach21

2

qw

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 2: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 77

related datasets (eg [485849]) thus focusing more on theoreti-cal considerations than scalability usually combining symbolic(eg reasoning with consistency checking) methods and similaritymeasures Some authors have looked at inter-linkage of domainspecific RDF datasets at various degrees of scale (eg [6159385446]) Further research has also looked at exploiting sharedterminological datamdashas well as explicitly asserted linksmdashto betterintegrate Linked Data collected from thousands or millions ofsources (eg [30501535]) the work presented herein falls mostclosely into this category One approach has tackled the problemfrom the publishers side detailing a system for manually specify-ing some (possibly heuristic) criteria for creating links betweentwo datasets [72] We leave further detailed related work toSection 9

In this paper we look at methods to provide better linkagebetween resources in particular focusing on finding equivalententities in the data Our notion of an entity is a representationof something being described by the data eg a person aplace a musician a protein etc We say that two entities areequivalent if they are coreferent eg refer to the same personplace etc2 Given a collection of datasets that speak about thesame referents using different identifiers we wish to identifythese coreferences and somehow merge the knowledge contribu-tion provided by the distinct parties We call this mergeconsolidation

In particular our work is inspired by the requirements of theSemantic Web Search Engine project [32] within which we aimto offer search and browsing over large static Linked Data corporacrawled from the Web3 The core operation of SWSE is to take userkeyword queries as input and to generate a ranked list of matchingentities as results After the core components of a crawler index anduser-interface we saw a clear need for a component that consoli-datesmdashby means of identifying and canonicalising equivalent identi-fiersmdashthe indexed corpus there was an observable lack of URIs suchthat coreferent blank-nodes were prevalent [30] even within thesame dataset and thus we observed many duplicate results referringto the same thing leading to poor integration of data from oursource documents

To take a brief example consider a simple example querylsquolsquoWHO DOES TIM BERNERS-LEE KNOWrsquorsquo Knowing that Tim uses the URItimblfoafi to refer to himself in his personal FOAF profiledocument and again knowing that the property foafknows

relates people to their (reciprocated) acquaintances wecan formulate this request as the SPARQL query [53] asfollows

SELECT person

WHERE timblfoafi foafknows person

However other publishers use different URIs to identify Timwhere to get more complete answers across these namingschemes the SPARQL query must use disjunctive UNION clausesfor each known URI here we give an example using a sample ofidentifiers extracted from a real Linked Data corpus (introducedlater)

2 Herein we avoid philosophical discussion on the notion of identity forinteresting discussion thereon see [26]

3 By static we mean that the system does not cater for updates this omissionallows for optimisations throughout the system Instead we aim at a cyclical indexingparadigm where new indexes are bulk-loaded in the background on separatemachines

SELECT person

WHERE timblfoafi foafknows personUNION identicau45563 foafknows personUNION dbpediaBerners-Lee foafknows personUNION dbpediaDr_Tim_Berners-Lee foafknows

personUNION dbpediaTim-Berners_Lee foafknows

personUNION dbpediaTimBL foafknows personUNION dbpediaTim_Berners-Lee foafknows

personUNION dbpediaTim_berners-lee foafknows

personUNION dbpediaTimbl foafknows personUNION dbpediaTimothy_Berners-Lee foafknows

personUNION yagorTim_Berners-Lee foafknows personUNION fbentim_berners-lee foafknows personUNION swidTim_Berners-Lee foafknows personUNION dblpperson100007 foafknows personUNION avtimblme foafknows personUNION bmpersonsTim+Berners-Lee foafknows

person

We see disparate URIs not only across data publishers but alsowithin the same namespace Clearly the expanded query quicklybecomes extremely cumbersome

In this paper we look at bespoke methods for identifying andprocessing coreference in a manner such that the resultant corpuscan be consumed as if more complete agreement on URIs was pres-ent in other words using standard query-answering techniqueswe want the enhanced corpus to return the same answers for theoriginal simple query as for the latter expanded query

Our core requirements for the consolidation component are asfollows

ndash the component must give high precision of consolidatedresults

ndash the underlying algorithm(s) must be scalablendash the approach must be fully automaticndash the methods must be domain agnostic

where a component with poor precision will lead to garbled fi-nal results merging unrelated entities where scalability is requiredto apply the process over our corpora typically in the order of a bil-lion statements (and which we feasibly hope to expand in future)where the scale of the corpora under analysis precludes any man-ual intervention and wheremdashfor the purposes of researchmdashthemethods should not give preferential treatment to any domain orvocabulary of data (other than core RDF(S)OWL terms) Alongsidethese primary requirements we also identify the following second-ary criteria

ndash the analysis should demonstrate high recallndash the underlying algorithm(s) should be efficient

where the consolidation component should identify as many(correct) equivalences as possible and where the algorithm shouldbe applicable in reasonable time Clearly the secondary require-ments are also important but they are superceded by those givenearlier where a certain trade-off exists we prefer a system that

78 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

gives a high percentage of correct results and leads to a clean con-solidated corpus over an approach that gives a higher percentage ofconsolidated results but leads to a partially garbled corpus simi-larly we prefer a system that can handle more data (is more scal-able) but may possibly have a lower throughput (is less efficient)4

Thus herein we revisit methods for scalable precise automaticand domain-agnostic entity consolidation over large static LinkedData corpora In order to make our methods scalable we avoid dy-namic on-disk index structures and instead opt for algorithms thatrely on sequential on-disk readswrites of compressed flat filesusing operations such as scans external sorts merge-joins andonly light-weight or non-critical in-memory indices In order tomake our methods efficient we demonstrate distributed imple-mentations of our methods over a cluster of shared-nothing com-modity hardware where our algorithms attempt to maximise theportion of time spent in embarrassingly parallel executionmdashieparallel independent computation without need for inter-machinecoordination In order to make our methods domain-agnostic anditfully-automatic we exploit the generic formal semantics of thedata described in RDF(S)OWL and also generic statistics derivablefrom the corpus In order to achieve high recall we incorporateadditional OWL features to find novel coreference through reason-ing Aiming at high precision we introduce methods that again ex-ploit the semantics and also the statistics of the data but toconversely disambiguate entities to defeat equivalences found inthe previous step that are unlikely to be true according to somecriteria

As such extending upon various reasoning and consolidationtechniques described in previous works [30313334] we now givea self-contained treatment of our results in this area In additionwe distribute the execution of all methods over a cluster of com-modity hardware we provide novel detailed performance andquality evaluation over a large real-world Linked Data corpus ofone billion statements and we also examine exploratory tech-niques for interlinking similar entities based on statistical analysesas well as methods for disambiguating and repairing incorrectcoreferences

In summary in this paper we

ndash provide some necessary preliminaries and describe our distrib-uted architecture (Section 2)

ndash characterise the 1 billion quadruple Linked Data corpus thatwill be used for later evaluation of our methods particularfocusing on the (re-)use of data-level identifiers in the corpus(Section 3)

ndash describe and evaluate our distributed base-line approach forconsolidation which leverages explicit owlsameAs relations(Section 4)

ndash describe and evaluate a distributed approach that extends con-solidation to consider a richer OWL semantics for consolidating(Section 5)

ndash present a distributed algorithm for determining weighted con-currence between entities using statistical analysis of predi-cates in the corpus (Section 6)

ndash present a distributed approach to disambiguate entitiesmdashiedetect likely erroneous consolidationmdashcombining the semanticsand statistics derivable from the corpus (Section 7)

ndash provide critical discussion (Section 8) render related work(Section 9) and conclude with discussion (Section 10)

4 Of course entity consolidation has many practical applications outside of ourreferential use-case SWSE and is useful in many generic query-answering secnariosmdashwe see these requirements as being somewhat fundamental to a consolidationcomponent to varying degrees

2 Preliminaries

In this section we provide some necessary preliminaries relat-ing to (i) RDF the structured format used in our corpora(Section 21) (ii) RDFSOWL which provides formal semantics toRDF data including the semantics of equality (Section 22) (iii)rules which are used to interpret the semantics of RDF and applyreasoning (Sections 23 24) (iv) authoritative reasoning whichcritically examines the source of certain types of Linked Data tohelp ensure robustness of reasoning (Section 25) (v) and OWL 2RLRDF rules a standard rule-based profile for supporting OWLsemantics a subset of which we use to deduce equality (Sec-tion 26) Throughout we attempt to preserve notation and termi-nology as prevalent in the literature

21 RDF

We briefly give some necessary notation relating to RDF con-stants and RDF triples see [28]

RDF constant Given the set of URI references U the set of blanknodes B and the set of literals L the set of RDF constants is denotedby C = U [ B [ L As opposed to the RDF-mandated existentialsemantics for blank-nodes we interpret blank-nodes as groundSkolem constants note also that we rewrite blank-node labels toensure uniqueness per document as per RDF merging [28]

RDF triple A triple t = (spo) 2 G = (U [ B) U C is called anRDF triple where s is called subject p predicate and o object Wecall a finite set of triples G G a graph This notion of a triple re-stricts the positions in which certain terms may appear We usethe phrase generalised triple where such restrictions are relaxedand where literals and blank-nodes are allowed to appear in thesubjectpredicate positions

RDF triple in contextquadruple Given c 2 U let http(c) denotethe possibly empty graph Gc given by retrieving the URI c throughHTTP (directly returning 200 Okay) An ordered pair (tc) with anRDF triple t = (spo) c 2U and t 2http(c) is called a triple in contextc We may also refer to (spoc) as an RDF quadruple or quad q withcontext c

22 RDFS OWL and owlsameAs

Conceptually RDF data is composed of assertional data (ie in-stance data) and terminological data (ie schema data) Assertionaldata define relationships between individuals provide literal-valued attributes of those individuals and declare individuals asmembers of classes Thereafter terminological data describe thoseclasses relationships (object properties) and attributes (datatypeproperties) and declaratively assigns them a semantics5 Withwell-defined descriptions of classes and properties reasoning canthen allow for deriving new knowledge including over assertionaldata

On the Semantic Web the RDFS [11] and OWL [42] standardsare prominently used for making statements with well-definedmeaning and enabling such reasoning Although primarilyconcerned with describing terminological knowledgemdashsuch assubsumption or equivalence between classes and propertiesetcmdashRDFS and OWL also contain features that operate on an asser-tional level The most interesting such feature for our purposes isowlsameAs a core OWL property that allows for defining equiva-lences between individuals (as well as classes and properties) Twoindividuals related by means of owlsameAs are interpreted asreferring to the same entity ie they are coreferent

5 Terminological and assertional data are not by any means necessarily disjointrthermore this distinction is not always required but is useful for our purposes

erein

Fuh

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 79

Interestingly OWL also contains other features (on a termino-logical level) that allow for inferring new owlsameAs relationsSuch features include

ndash inverse-functional properties which act as lsquolsquokey propertiesrsquorsquo foruniquely identifying subjects eg an isbn datatype-propertywhose values uniquely identifies books and other documents orthe object-property biologicalMotherOf which uniquelyidentifies the biological mother of a particular child

ndash functional properties which work in the inverse direction anduniquely identify objects eg the property hasBiological

Motherndash cardinality and max-cardinality restrictions which when given

a value of 1 uniquely identify subjects of a given class eg thevalue for the property spouse uniquely identifies a member ofthe class Monogamist

Thus we see that RDFS and OWL contain various features thatallow for inferring and supporting coreference between individu-als In order to support the semantics of such features we applyinference rules introduced next

23 Inference rules

Herein we formalise the notion of an inference rule as applicableover RDF graphs [33] We begin by defining the preliminary con-cept of a triple pattern of which rules are composed and whichmay contain variables in any position

Triple pattern basic graph pattern A triple pattern is a generalisedtriple where variables from the set V are allowed ie tv = (svpvov)2GV GV = (C [ V) (C [ V) (C [ V) We call a set (to be read asconjunction) of triple patterns GV GV a basic graph pattern We de-note the set of variables in graph pattern GV by VethGVTHORN

Intuitively variables appearing in triple patterns can be boundby any RDF term in C Such a mapping from variables to constantsis called a variable binding

Variable bindings Let X be the set of variable bindingV [ C V [ C that map every constant c 2 C to itself and every var-iable v 2 V to an element of the set C [ V A triple t is a binding of atriple pattern tv = (svpvov) iff there exists l 2X such thatt = l(tv) = (l(sv) l(pv) l(ov)) A graph G is a binding of a graph pat-tern GV iff there exists a mapping l 2X such that

Stv2GVlethtv THORN frac14 G

we use the shorthand lethGVTHORN frac14 G We use XethGVGTHORN frac14 fl j lethGVTHORNGlethvTHORN frac14 v if v R VethGVTHORNg to denote the set of variable bindingmappings for graph pattern GV in graph G that map variables out-side GV to themselves

We can now give the notion of an inference rule which is com-prised of a pair of basic graph patterns that form an lsquolsquoIF) THENrsquorsquological structure

Inference rule We define an inference rule (or often just rule) as apair r frac14 ethAnter ConrTHORN where the antecedent (or body) Anter GV

and the consequent (or head) Conr GV are basic graph patternssuch that all variables in the consequent appear in the antecedentVethConrTHORN VethAnterTHORN We write inference rules as Anter ) Conr

Finally rules allow for applying inference through rule applica-tions where the rule body is used to generate variable bindingsagainst a given RDF graph and where those bindings are appliedon the head to generate new triples that graph entails

Rule application and standard closure A rule application is theimmediate consequences TrethGTHORN frac14

Sl2XethAnter GTHORNethlethConrTHORN n lethAnterTHORNTHORN

of a rule r on a graph G accordingly for a ruleset RTRethGTHORN frac14

Sr2RTrethGTHORN Now let Githorn1 frac14 Gi [ TRethGiTHORN and G0 frac14 G the

exhaustive application of the TR operator on a graph G is then theleast fixpoint (the smallest value for n) such that Gn frac14 TRethGnTHORNWe call Gn the closure of G wrt ruleset R denoted as ClRethGTHORN or suc-cinctly G where the ruleset is obvious

The above closure takes a graph and a ruleset and recursivelyapplies the rules over the union of the original graph and the infer-ences until a fixpoint

24 T-split rules

In [3133] we formalised an optimisation based on a separationof terminological knowledge from assertional data when applyingrules Similar optimisations have also been used by other authorsto enable large-scale rule-based inferencing [747143] This opti-misation is based on the premise that in Linked Data (and variousother scenarios) terminological data speaking about classes andproperties is much smaller than assertional data speaking aboutindividuals Previous observations indicate that for a Linked Datacorpus in the order of a billion triples such terminological datacomprises of about 01 of the total data volume Additionally ter-minological data is very frequently accessed during reasoningHence seperating and storing the terminological data in memoryallows for more efficient execution of rules albeit at the cost ofpossible incompleteness [7433]

We now reintroduce some of the concepts relating to separatingterminological information [33] which we will use later whenapplying rule-based reasoning to support the semantics of OWLequality We begin by formalising some related concepts that helpdefine our notion of terminological information

Meta-class We consider a meta-class as a class whose membersare themselves (always) classes or properties Herein we restrictour notion of meta-classes to the set defined in RDF(S) andOWL specifications where examples include rdfPropertyrdfsClass owlDatatypeProperty owlFunctionalProp-

erty etc Note that eg owlThing and rdfsResource are notmeta-classes since not all of their members are classes or properties

Meta-property A meta-property is one that has a meta-class asits domain again we restrict our notion of meta-properties tothe set defined in RDF(S) and OWL specifications where examplesinclude rdfsdomain rdfsrange rdfssubClassOfowlhasKey owlinverseOf owloneOf owlonPropertyowlunionOf etc Note that rdftype owlsameAs rdfslabeleg do not have a meta-class as domain and so are not consideredas meta-properties

Terminological triple We define the set of terminological triplesT G as the union of

(i) triples with rdftype as predicate and a meta-class asobject

(ii) triples with a meta-property as predicate(iii) triples forming a valid RDF list whose head is the object of a

meta-property (eg a list used for owlunionOf etc)

The above concepts cover what we mean by terminological datain the context of RDFS and OWL Next we define the notion of tri-ple patterns that distinguish between these different categories ofdata

Terminologicalassertional pattern We refer to a terminological-triple-graph pattern as one whose instance can only be a termino-logical triple or resp a set thereof An assertional pattern is anypattern that is not terminological

Given the above notions of terminological datapatterns we cannow define the notion of a T-split inference rule that distinguishesterminological from assertional information

T-split inference rule Given a rule r frac14 ethAnter Conr) we define aT-split rule rs as the triple (AnteT

rs AnteGrs Con) where AnteT

rs isthe set of terminological patterns in Anter and AnteG

rs frac14Anter nAnteT

rs The body of T-split rules are divided into a set of patterns that

apply over terminological knowledge and a set of patterns that

80 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

apply over assertional (ie any) knowledge Such rules enable anoptimisation whereby terminological patterns are pre-bound togenerate a new larger set of purely assertional rules called T-ground rules We illustrate this with an example

Example Let Rprp-ifp denote the following rule

(p a owl InverseFunctionalProperty)(x1 p y)(x2 p y))(x1 owl sameAs x2)

When writing T-split rules we denote terminological patternsin the body by underlining Also we use lsquoarsquo as a convenient short-cut for rdftype which indicates class-membership Now take theterminological triples

(isbn a owlInverseFunctionalProperty)(mbox a owlInverseFunctionalProperty)

From these two triples we can generate two T-ground rules asfollows

(x1 isbn y)(x2 isbn y))(x1 owl sameAs x2)

(x1 mbox y)(x2 mbox y))(x1 owl sameAs x2)

These T-ground rules encode the terminological OWL knowledgegiven by the previous two triples and can be applied directly overthe assertional data eg describing specific books with ISBNnumbers

25 Authoritative T-split rules

Caution must be exercised when applying reasoning over arbi-trary data collected from the Web eg Linked Data In previousworks we have encountered various problems when naiumlvely per-forming rule-based reasoning over arbitrary Linked Data [3133]In particular third-party claims made about popular classes andproperties must be critically analysed to ensure their trustworthi-ness As a single example of the type of claims that can cause issueswith reasoning we found one document that defines nine propertiesas the domain of rdftype6 thereafter the semantics ofrdfsdomainmandate that everything which is a member of any class(ie almost every known resource) can be inferred to be a lsquolsquomemberrsquorsquo ofeach of the nine properties We found that naiumlve reasoning lead to200 more inferences than would be expected when consideringthe definitions of classes and properties as defined in their lsquolsquoname-space documentsrsquorsquo Thus in the general case performing rule-basedmaterialisation with respect to OWL semantics over arbitrary LinkedData requires some critical analysis of the source of data as per ourauthoritative reasoning algorithm Please see [9] for a detailed analysisof the explosion in inferences that occurs when authoritative analysisis not applied

Such observations prompted us to investigate more robustforms of reasoning Along these lines when reasoning over LinkedData we apply an algorithm called authoritative reasoning[143133] which critically examines the source of terminologicaltriples and conservatively rejects those that cannot be definitivelytrusted ie those that redefine classes andor properties outside ofthe namespace document

6 viz httpwwweiaonetrdf10

Dereferencing and authority We first give the functionhttpU 2G that maps a URI to an RDF graph it returns upon aHTTP lookup that returns the response code 200 Okay in the caseof failure the function maps to an empty RDF graph Next we de-fine the function redirU U that follows a (finite) sequence ofHTTP redirects given upon lookup of a URI or that returns theURI itself in the absence of a redirect or in the case of failure thisfunction first strips the fragment identifier (given after the lsquorsquo char-acter) from the original URI if present Then we give the dereferenc-ing function derefhttpredir as the mapping from a URI (a Weblocation) to an RDF graph it may provide by means of a given HTTPlookup that follows redirects Finally we can define the authorityfunctionmdashthat maps the URI of a Web document to the set of RDFterms it is authoritative formdashas follows

auth U 2C

u fc 2 U j redirethcTHORN frac14 ug [ BethhttpethsTHORNTHORNwhere B(G) denotes the set of blank-nodes appearing in a graph GIn other words a Web document is authoritative for URIs that dere-ference to it and the blank nodes it contains

We note that making RDF vocabulary URIs dereferenceable isencouraged in various best-practices [454] To enforce authorita-tive reasoning when applying rules with terminological and asser-tional patterns in the body we require that terminological triplesare served by a document authoritative for terms that are boundby specific variable positions in the rule

Authoritative T-split rule application Given a T-split rule and arule application involving a mapping l as before for the rule appli-cation to be authoritative there must additionally exist a l(v) suchthat v 2 VethAnteT THORN VethAnteGTHORN l(v) 2 auth(u) lethAnteT THORN httpethuTHORNWe call the set of variables that appear in both the terminologicaland assertional segment of the rule body (ie VethAnteT THORN VethAnteGTHORN) the set of authoritative variables for the rule Where a ruledoes not have terminological or assertional patterns we considerany rule-application as authoritative The notation of an authorita-tive T-ground rule follows likewise

Example Let Rprp-ifp denote the same rule as used in the previousexample and let the following two triples be given by the derefer-enced document of v1isbn but not v2mboxmdashie a documentauthoritative for the former term but not the latter

(v1isbn a owlInverseFunctionalProperty)(v2mbox a owlInverseFunctionalProperty)

The only authoritative variable appearing in both the termino-logical and assertional patterns of the body of Rprp-ifp is p boundby v1isbn and v2mbox above Since the document serving thetwo triples is authoritative for v1isbn the first triple is consid-ered authoritative and the following T-ground rule will begenerated

(x1 v1isbn y)(x2 v1isbn y))(x1 owlsameAs x2)

However since the document is not authoritative for v2mboxthe second T-ground rule will be considered non-authoritative anddiscarded

(x1 v2mbox y)(x2 v2mbox y))(x1 owlsameAs x2)

where third-party documents re-define remote terms in a way thataffects inferencing over instance data we filter such non-authoritative definitions

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 81

26 OWL 2 RLRDF rules

Inference rules can be used to (partially) support the semanticsof RDFS and OWL Along these lines the OWL 2 RLRDF [24] rulesetis a partial-axiomatisation of the OWL 2 RDF-Based Semanticswhich is applicable for arbitrary RDF graphs and constitutes anextension of the RDF Semantics [28] in other words the rulesetsupports a standardised profile of reasoning that partially coversthe complete OWL semantics Interestingly for our scenario thisprofile includes inference rules that support the semantics andinference of owlsameAs as discussed

First in Table 1 we provide the set of OWL 2 RLRDF rules thatsupport the (positive) semantics of owl sameAs axiomatising thesymmetry (rule eq-sym) and transitivity (rule eq-trans) of therelation To take an example rule eq-sym is as follows

(x owlsameAs y))(y owlsameAs x)

where if we know that (a owlsameAs b) the rule will inferthe symmetric relation (b owlsameAs a) Rule eq-trans oper-ates analogously both together allow for computing the transitivesymmetric closure of the equivalence relation Furthermore Table 1also contains the OWL 2 RLRDF rules that support the semantics ofreplacement (rules eq-rep-) whereby data that holds for one en-tity must also hold for equivalent entities

Note that we (optionally and in the case of later evaluation)choose not to support

(i) the reflexive semantics of owlsameAs since reflexiveowlsameAs statements will not lead to any consolidationor other non-reflexive equality relations and will produce alarge bulk of materialisations

(ii) equality on predicates or values for rdftype where we donot want possibly imprecise owlsameAs data to affectterms in these positions

Given that the semantics of equality is quadratic with respect tothe A-Box we apply a partial-materialisation approach that givesour notion of consolidation instead of materialising (ie explicitlywriting down) all possible inferences given by the semantics ofreplacement we instead choose one canonical identifier to repre-sent the set of equivalent terms thus effectively compressing thedata We have used this approach in previous works [30ndash32] andit has also appeared in related works in the literature [706] as acommon-sense optimisation for handling data-level equality Totake an example in later evaluation (cf Table 10) we find a validequivalence class (set of equivalent entities) with 33052 mem-bers materialising all pairwise non-reflexive owlsameAs state-ments would infer more than 1 billion owlsameAs relations(330522 33052) = 1092401652 further assuming that eachentity appeared in on average eg two quadruples we would inferan additional 2 billion of massively duplicated data By choosinga single canonical identifier we would instead only materialise100 thousand statements

Note that although we only perform partial materialisation wedo not change the semantics of equality our methods are sound

Table 1Rules that support the positive semantics of owl sameAs

OWL2RL Antecedent ConsequentAssertional

eq-sym x owlsameAs y y owlsameAs x eq-trans x owlsameAs y y owlsameAs z x owlsameAs z eq-rep-s s owlsameAs s0 s p o s0 p o eq-rep-o o owlsameAs o0 s p o s p o0

with respect to OWL semantics In addition alongside the partiallymaterialised data we provide a set of consolidated owl sameAs

relations (containing all of the identifiers in each equivalence class)that can be used to lsquolsquobackward-chainrsquorsquo the full inferences possiblethrough replacement (as required) Thus we do not consider thecanonical identifier as somehow lsquodefinitiversquo or superceding theother identifiers but merely consider it as representing the equiva-lence class7

Finally herein we do not consider consolidation of literalsone may consider useful applications eg for canonicalisingdatatype literals but such discussion is out of the currentscope

As previously mentioned OWL 2 RLRDF also contains rules thatuse terminological knowledge (alongside assertional knowledge)to directly infer owl sameAs relations We enumerate these rulesin Table 2 note that we italicise the labels of rules supporting fea-tures new to OWL 2 and that we list authoritative variables inbold8

Applying only these OWL 2 RLRDF rules may miss inference ofsome owlsameAs statements For example consider the exampleRDF data given in Turtle syntax [3] as follows

From the FOAF Vocabulary

foafhomepage rdfssubPropertyOf

foafisPrimaryTopicOf

foafisPrimaryTopicOf owlinverseOf

foafprimaryTopic

foafisPrimaryTopicOf a

owlInverseFunctionalProperty

From Example Document A

exAaxel foafhomepage lthttppolleresnetgt From Example Document B

an

apclsvo

lthttppolleresnetgt foafprimaryTopic exBapolleres

Inferred through prp-spo1

exAaxel foafisPrimaryTopicOf lthttppolleresnetgt

Inferred through prp-invexBapolleres foafisPrimaryTopicOf lthttppolleresnetgt

Subsequently inferred through prp-ifpexAaxel owlsameAs exBapolleres

exBapolleres owlsameAs exAaxel

The example uses properties from the prominent FOAF vocabu-lary which is used for publishing personal profiles as RDF Here weadditionally need OWL 2 RLRDF rules prp-inv and prp-spo1mdashhandling standard owlinverseOf and rdfssubPropertyOf

inferencing respectivelymdashto infer the owlsameAs relationentailed by the data

Along these lines we support an extended set of OWL 2 RLRDFrules that contain precisely one assertional pattern and for whichwe have demonstrated a scalable implementation called SAORdesigned for Linked Data reasoning [33] These rules are listed inTable 3 and as per the previous example can generateadditional inferences that indirectly lead to the derivation of newowl sameAs data We will use these rules for entity consolidationin Section 5

Lastly as we discuss later (particularly in Section 7) Linked Datais inherently noisy and prone to mistakes Although our methods

7 We may optionally consider non-canonical blank-node identifiers as redundand discard them

8 As discussed later our current implementation requires at least one variable topear in all assertional patterns in the body and so we do not support prp-key and-maxqc3 however these two features are not commonly used in Linked Datacabularies [29]

t

Table 2OWL 2 RLRDF rules that directly produce owl sameAs relations we denote authoritative variables with bold and italicise the labels of rules requiring new OWL 2 constructs

OWL2RL Antecedent Consequent

Terminological Assertional

prp-fp p a owlFunctionalProperty x p y1 y2 y1 owlsameAs y2

prp-ifp p a owlInverseFunctionalProperty x1 p y x2 p y x1 owlsameAs x2

cls-maxc2 x owlmaxCardinality 1 u a x y1 owlsameAs y2

x owlonProperty p u p y1 y2

cls-maxqc4 x owlmaxQualifiedCardinality 1 u a x y1 owlsameAs y2

x owlonProperty p u p y1 y2

x owlonClass owl Thing

Table 3OWL 2 RLRDF rules containing precisely one assertional pattern in the body with authoritative variables in bold (cf[33])

OWL2RL Antecedent Consequent

Terminological Assertional

prp-dom p rdfsdomain c x p y x a cprp-rng p rdfsrange c x p y y a cprp-symp p a owlSymmetricProperty x p y y p xprp-spo1 p1 rdfssubPropertyOf p2 x p1 y x p2 yprp-eqp1 p1 owlequivalentProperty p2 x p1 y x p2 yprp-eqp2 p1 owlequivalentProperty p2 x p2 y x p1 yprp-inv1 p1 owlinverseOf p2 x p1 y y p2 xprp-inv2 p1 owlinverseOf p2 x p2 y y p1 xcls-int2 c owlintersectionOf (c1 cn) x a c x a c1 cn

cls-uni c owlunionOf (c1 ci cn) x a ci x a ccls-svf2 x owlsomeValuesFrom owlThing owlonProperty p u p v u a xcls-hv1 x owlhasValue y owlonProperty p u a x u p ycls-hv2 x owlhasValue y owlonProperty p u p y u a xcax-sco c1 rdfssubClassOf c2 x a c1 x a c2

cax-eqc1 c1 owlequivalentClass c2 x a c1 x a c2

cax-eqc2 c1 owlequivalentClass c2 x a c2 x a c1

Table 4OWL 2 RLRDF rules used to detect inconsistency that we currently support wedenote authoritative variables with bold

OWL2RL Antecedent

Terminological Assertional

eq-diff1 ndash x owlsameAs yx owldifferentFrom y

prp-irp p a owlIrreflexiveProperty x p xprp-asyp p a owlAsymmetricProperty x p it y y p xprp-pdw p1owlpropertyDisjointWith p2 x p1 y p2 yprp-adp x a owlAllDisjointProperties u pi y pj y (i ndash j)

x owlmembers (p1 pn)cls-com c1 owlcomplementOf c2 x a c1 c2

cls-maxc1 x owlmaxCardinality 0 u a x p yx owlonProperty p

cls-maxqc2 x owlmaxQualifiedCardinality 0 u a x p yx owlonProperty px owlonClass owlThing

cax-dw c1 owldisjointWith c2 x a c1c2cax-adc x a owlAllDisjointClasses z a ci cj (i ndash j)

x owlmembers (c1 cn)

82 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

are sound (ie correct) with respect to formal OWL semanticssuch noise in our input data may lead to unintended consequenceswhen applying reasoning which we wish to minimise andorrepair Relatedly OWL contains features that allow for detectingformal contradictionsmdashcalled inconsistenciesmdashin RDF data Anexample of an inconsistency would be where something is amember of two disjoint classes such as foafPerson andfoafOrganization formally the intersection of such disjointclasses should be empty and they should not share membersAlong these lines OWL 2 RLRDF contains rules (with the specialsymbol false in the head) that allow for detecting inconsistencyin an RDF graph We use a subset of these consistency-checkingOWL 2 RLRDF rules later in Section 7 to try to automatically detectand repair unintended owlsameAs inferences the supportedsubset is listed in Table 4

27 Distribution architecture

To help meet our scalability requirements our methods areimplemented on a shared-nothing distributed architecture [64]over a cluster of commodity hardware The distributed frameworkconsists of a master machine that orchestrates the given tasks andseveral slave machines that perform parts of the task in parallel

The master machine can instigate the following distributedoperations

ndash Scatter partition on-disk data using some local split functionand send each chunk to individual slave machines for subse-quent processing

ndash Run request the parallel execution of a task by the slavemachinesmdashsuch a task either involves processing of somedata local to the slave machine or the coordinate method(described later) for reorganising the data under analysis

ndash Gather gathers chunks of output data from the slave swarmand performs some local merge function over the data

ndash Flood broadcast global knowledge required by all slavemachines for a future task

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 83

The master machine provides input data to the slave swarmprovides the control logic required by the distributed task (com-mencing tasks coordinating timing ending tasks) gathers andlocally perform tasks on global knowledge that the slave machineswould otherwise have to replicate in parallel and transmits glob-ally required knowledge

The slave machines as well as performing tasks in parallel canperform the following distributed operation (at the behest of themaster machine)

ndash Coordinate local data on each slave machine is partitionedaccording to some split function with the chunks sent toindividual machines in parallel each slave machine alsogathers the incoming chunks in parallel using some mergefunction

The above operation allows slave machines to reorganise(splitsendgather) intermediary amongst themselves the coor-dinate operation could be replaced by a pair of gatherscatteroperations performed by the master machine but we wish toavoid the channelling of all intermediary data through onemachine

Note that herein we assume that the input corpus is evenlydistributed and split across the slave machines and that theslave machines have roughly even specifications that is we donot consider any special form of load balancing but insteadaim to have uniform machines processing comparable data-chunks

We note that there is the practical issue of the master ma-chine being idle waiting for the slaves and more critically thepotentially large cluster of slave machines waiting idle for themaster machine One could overcome idle times with maturetask-scheduling (eg interleaving jobs) and load-balancing Froman algorithmic point of view removing the central coordinationon the master machine may enable better distributability Onepossibility would be to allow the slave machines to duplicatethe aggregation of global knowledge in parallel although thiswould free up the master machine and would probably takeroughly the same time duplicating computation wastes resourcesthat could otherwise be exploited by eg interleaving jobs Asecond possibility would be to avoid the requirement for globalknowledge and to coordinate upon the larger corpus (eg a coor-dinate function hashing on the subject and object of the data orperhaps an adaptation of the SpeedDate routing strategy [51])Such decisions are heavily influenced by the scale of the taskto perform the percentage of knowledge that is globally requiredhow the input data are distributed how the output data shouldbe distributed and the nature of the cluster over which it shouldbe performed and the task-scheduling possible The distributedimplementation of our tasks are designed to exploit a relativelysmall percentage of global knowledge which is cheap to coordi-nate and we choose to avoidmdashinsofar as reasonablemdashduplicatingcomputation

28 Experimental setup

Our entire code-base is implemented on top of standard Javalibraries We instantiate the distributed architecture using JavaRMI libraries and using the lightweight open-source Java RMIIOpackage9 for streaming data for the network

9 httpopenhmssourceforgenetrmiio

10 We observe eg a max FTP transfer rate of 38MBs between machines11 Please see [32 Appendix A] for further statistics relating to this corpus

All of our evaluation is based on nine machines connected byGigabit ethernet10 each with uniform specifications viz 22 GHzOpteron x86-64 4 GB main memory 160 GB SATA hard-disks run-ning Java 160_12 on Debian 504

3 Experimental corpus

Later in this paper we discuss the performance and results ofapplying our methods over a corpus of 1118 billion quadruplesderived from an open-domain RDFXML crawl of 3985 millionweb documents in mid-May 2010 (detailed in [3229]) The crawlwas conducted in a breadth-first manner extracting URIs from allpositions of the RDF data Individual URI queues were assigned todifferent pay-level-domains (aka PLDs domains that require pay-ment eg deriie datagovuk) where we enforced a polite-ness policy of accessing a maximum of two URIs per PLD persecond URIs with the highest inlink count per each PLD queuewere polled first

With regards the resulting corpus of the 1118 billion quads1106 billion are unique and 947 million are unique triples Thedata contain 23 thousand unique predicates and 105 thousandunique class terms (terms in the object position of an rdftype

triple) In terms of diversity the corpus consists of RDF from 783pay-level-domains [32]

Note that further details about the parameters of the crawl andstatistics about the corpus are available in [3229]

Now we discuss the usage of terms in a data-level position vizterms in the subject position or object position of non-rdftypetriples11 Since we do not consider the consolidation of literals orschema-level concepts we focus on characterising blank-node andURI re-use in such data-level positions thus rendering a picture ofthe structure of the raw data

We found 2863 million unique terms of which 1654 million(578) were blank-nodes 921 million (322) were URIs and289 million (10) were literals With respect to literals each hadon average 9473 data-level occurrences (by definition all in theobject position)

With respect to blank-nodes each had on average 5233 data-level occurrences Each occurred on average 0995 times in the ob-ject position of a non-rdftype triple with 31 million (187) notoccurring in the object position conversely each occurred on aver-age 4239 times in the subject position of a triple with 69 thousand(004) not occurring in the subject position Thus we summarisethat almost all blank-nodes appear in both the subject position andobject position but occur most prevalently in the former Impor-tantly note that in our input blank-nodes cannot be re-used acrosssources

With respect to URIs each had on average 941 data-leveloccurrences (18 the average for blank-nodes) with 4399 aver-age appearances in the subject position and 501 appearances inthe object positionmdash1985 million (2155) did not appear in anobject position whilst 5791 million (6288) did not appear in asubject position

With respect to re-use across sources each URI had a data-leveloccurrence in on average 47 documents and 1008 PLDsmdash562million (6102) of URIs appeared in only one document and913 million (9913) only appeared in one PLD Also re-use of URIsacross documents was heavily weighted in favour of use in the ob-ject position URIs appeared in the subject position in on average1061 documents and 0346 PLDs for the object position of

84 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

non-rdftype triples URIs occurred in on average 3996 docu-ments and 0727 PLDs

The URI with the most data-level occurrences (166 million)was httpidentica the URI with the most re-use acrossdocuments (appearing in 1793 thousand documents) washttpcreativecommonsorglicensesby30 the URIwith the most re-use across PLDs (appearing in 80 different do-mains) was httpwwwldoddscomfoaffoaf-a-maticAlthough some URIs do enjoy widespread re-use across differentdocuments and domains in Figs 1 and 2 we give the distribu-tion of re-use of URIs across documents and across PLDs wherea powerndashlaw relationship is roughly evidentmdashagain the majorityof URIs only appear in one document (61) or in one PLD(99)

From this analysis we can conclude that with respect to data-level terms in our corpus

ndash blank-nodes which by their very nature cannot be re-usedacross documents are 18 more prevalent than URIs

ndash despite a smaller number of unique URIs each one is used in(probably coincidentally) 18 more triples

ndash unlike blank-nodes URIs commonly only appear in either a sub-ject position or an object position

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100 1000 10000 100000 1e+006

num

ber o

f UR

Is

number of documents mentioning URI

Fig 1 Distribution of URIs and the number of documents they appear in (in a data-position)

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100

num

ber o

f UR

Is

number of PLDs mentioning URI

Fig 2 Distribution of URIs and the number of PLDs they appear in (in a data-position)

ndash each URI is re-used on average in 47 documents but usuallyonly within the same domainmdashmost external re-use is in theobject position of a triple

ndash 99 of URIs appear in only one PLD

We can conclude that within our corpusmdashitself a general crawlfor RDFXML on the Webmdashwe find that there is only sparse re-use of data-level terms across sources and particularly acrossdomains

Finally for the purposes of demonstrating performance acrossvarying numbers of machines we extract a smaller corpus com-prising of 100 million quadruples from the full evaluation corpusWe extract the sub-corpus from the head of the raw corpus sincethe data are ordered by access time polling statements from thehead roughly emulates a smaller crawl of data which shouldensure eg that all well-linked vocabularies are containedtherein

4 Base-line consolidation

We now present the lsquolsquobase-linersquorsquo algorithm for consolidationthat consumes asserted owlsameAs relations in the dataLinked Data best-practices encourage the provision ofowlsameAs links between exporters that coin different URIsfor the same entities

lsquolsquoIt is common practice to use the owlsameAs property for statingthat another data source also provides information about a specificnon-information resourcersquorsquo [7]

We would thus expect there to be a significant amount of expli-cit owlsameAs data present in our corpusmdashprovided directly bythe Linked Data publishers themselvesmdashthat can be directly usedfor consolidation

To perform consolidation over such data the first step is toextract all explicit owlsameAs data from the corpus We mustthen compute the transitive and symmetric closure of the equiv-alence relation and build equivalence classes sets of coreferentidentifiers Note that the set of equivalence classes forms a parti-tion of coreferent identifiers where due to the transitive andsymmetric semantics of owlsameAs each identifier can only ap-pear in one such class Also note that we do not need to considersingleton equivalence classes that contain only one identifierconsolidation need not perform any action if an identifier isfound only to be coreferent with itself Once the closure of as-serted owlsameAs data has been computed we then need tobuild an index that maps identifiers to the equivalence class inwhich it is contained This index enables lookups of coreferentidentifiers Next in the consolidated data we would like to col-lapse the data mentioning identifiers in each equivalence classto instead use a single consistent canonical identifier for eachequivalence class we must thus choose a canonical identifier Fi-nally we can scan the entire corpus and using the equivalenceclass index rewrite the original identifiers to their canonincalform As an optional step we can also add links between eachcanonical identifier and its coreferent forms using an artificalowlsameAs relation in the output thus persisting the originalidentifiers

The distributed algorithm presented in this section is basedon an in-memory equivalence class closure and index and isthe current method of consolidation employed by the SWSE sys-tem described previously in [32] Herein we briefly reintroducethe approach from [32] where we also add new performanceevaluation over varying numbers of machines in the distributedsetup present more discussion and analysis of the use ofowlsameAs in our Linked Data corpus and manually evaluatethe precision of the approach for an extended sample of one

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 85

thousand coreferent pairs In particular this section serves as abaseline for comparison against the extended consolidation ap-proach that uses richer reasoning features explored later inSection 5

41 High-level approach

Based on the previous discussion the approach is straight-forward

(i) scan the corpus and separate out all asserted owlsameAs

relations from the main body of the corpus(ii) load these relations into an in-memory index that encodes

the transitive and symmetric semantics of owlsameAs(iii) for each equivalence class in the index choose a canonical

term(iv) scan the corpus again canonicalising any term in the

subject position or object position of an rdftype

triple

Thus we need only index a small subset of the corpusmdashowlsameAs statementsmdashand can apply consolidation by meansof two scans

The non-trivial aspects of the algorithm are given by theequality closure and index To perform the in-memory transi-tive symmetric closure we use a traditional unionndashfind algo-rithm [6640] for computing equivalence partitions where (i)equivalent elements are stored in common sets such that eachelement is only contained in one set (ii) a map provides a find

function for looking up which set an element belongs to and(iii) when new equivalent elements are found the sets they be-long to are unioned The process is based on an in-memory mapindex and is detailed in Algorithm 1 where

(i) the eqc labels refer to equivalence classes and s and o refer toRDF subject and object terms

(ii) the function mapget refers to an identifier lookup on the in-memory index that should return the intermediary equiva-lence class associated with that identifier (ie find)

(iii) the function mapput associates an identifier with a newequivalence class

The output of the algorithm is an in-memory map from identi-fiers to their respective equivalence class

Next we must choose a canonical term for each equivalenceclass we prefer URIs over blank-nodes thereafter choosing a termwith the lowest alphabetical ordering a canonical term is thusassociated to each equivalent set Once the equivalence index hasbeen finalised we re-scan the corpus and canonicalise the datausing the in-memory index to service lookups

42 Distributed approach

Again distribution of the approach is fairly intuitive as follows

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract owlsameAs relations

(ii) Gather gather all owlsameAs relations onto the mastermachine and build the in-memory equality index

(iii) Floodrun send the equality index (in its entirety) to eachslave machine and apply the consolidation scan in parallel

As we will see in the next section the most expensive meth-odsmdashinvolving the two scans of the main corpusmdashcan be con-ducted in parallel

Algorithm 1

Building equivalence map [32]

Require SAMEAS DATA SA1 map 2 for (sowlsameAso) 2 SA ^ s ndash o do3 eqcs mapget(s)4 eqco mapget(o)5 if eqcs = ^ eqco = then6 eqcs[o so7 mapput(seqcs[o)8 mapput(oeqcs[o)9 else if eqcs = then

10 add s to eqco

11 mapput(seqco)12 else if eqco = then13 add o to eqcs

14 mapput(oeqcs)15 else if eqcs ndash eqco then16 if jeqcsj gt jeqcoj17 add all eqco into eqcs

18 for eo 2 eqco do19 mapput(eoeqcs)20 end for21 else22 add all eqcs into eqco

23 for es 2 eqcs do24 mapput(eseqco)25 end for26 end if27 end if28 end for

43 Performance evaluation

We applied the distributed base-line consolidation over our cor-pus with the aformentioned procedure and setup The entire con-solidation process took 633 min with the bulk of time taken asfollows the first scan extracting owlsameAs statements took125 min with an average idle time for the servers of 11 s(14)mdashie on average the slave machines spent 14 of the timeidly waiting for peers to finish Transferring aggregating and load-ing the owlsameAs statements on the master machine took84 min The second scan rewriting the data according to thecanonical identifiers took in total 423 min with an average idletime of 647 s (25) for each machine at the end of the roundThe slower time for the second round is attributable to the extraoverhead of re-writing the GZip-compressed data to disk as op-posed to just reading

In the rightmost column of Table 5 (full-8) we give a break-down of the timing for the tasks over the full corpus using eightslave machines and one master machine Independent of the num-ber of slaves we note that the master machine required 85 min forcoordinating globally-required owlsameAs knowledge and thatthe rest of the task time is spent in embarrassingly parallel execu-tion (amenable to reduction by increasing the number of ma-chines) For our setup the slave machines were kept busy for onaverage 846 of the total task time of the idle time 87 wasspent waiting for the master to coordinate the owlsameAs dataand 13 was spent waiting for peers to finish their task due tosub-optimal load balancing The master machine spent 866 ofthe task idle waiting for the slaves to finish

In addition Table 5 also gives performance results for onemaster and varying numbers of slave machines (1248) over the

Table 5Breakdown of timing of distributed baseline consolidation

Category 1 2 4 8 Full-8

Min Min Min min Min

MasterTotal execution time 422 1000 223 1000 119 1000 67 1000 633 1000Executing 05 11 05 22 05 40 05 75 85 134

Aggregating owl sameAs 04 11 05 22 05 40 05 75 84 133Miscellaneous 01 01 01 04 01 01

Idle (waiting for slaves) 418 989 218 977 114 958 62 921 548 866

SlavesAvg Executing (total) 418 989 218 977 114 958 62 921 535 846

Extract owl sameAs 90 214 45 202 22 184 11 168 123 195Consolidate 327 775 171 769 88 735 47 705 412 651

Avg Idle 05 11 07 29 10 82 08 127 98 154Waiting for peers 00 00 01 07 05 40 03 47 13 20Waiting for master 05 11 05 23 05 42 05 79 85 134

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000

num

ber o

f cla

sses

equivalence class size

Fig 3 Distribution of sizes of equivalence classes on loglog scale (from [32])

1

10

100

1000

10000

100000

1e+006

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

Fig 4 Distribution of the number of PLDs per equivalence class on loglog scale

2 For example see the coreference results given by httpwwwrkbexplorercoma m e A s u r i = h t t p w w w a c m r k b e x p l o r e r c o m i d p e r s o n - 5 3 2 9 2 -2877d02973d0d01e8f29c7113776e7e which at the time of writing correspond to36 out of the 443 equivalent URIs found for Dieter Fensel

86 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

100 million quadruple corpus Note that in the table we use thesymbol to indicate a negligible value (0 lt lt 005) We see thatas the number of slave machines increases so too does the percent-age of time taken to aggregate global knowledge on the master ma-chine (the absolute time stays roughly stable) This indicates thatfor a high number of machines the aggregation of global knowl-edge will eventually become a bottleneck and an alternative distri-bution strategy may be preferable subject to further investigationOverall however we see that the total execution times roughlyhalve when the number of slave machines are doubled we observea 0528 0536 and 0561 reduction in total task executiontime moving from 1 slave machine to 2 2 to 4 and 4 to 8 respec-tively (here a value of 05 would indicate linear scaling out)

44 Results evaluation

We extracted 1193 million raw owlsameAs quadruples fromthe full corpus Of these however there were only 377 million un-ique triples After closure the data formed 216 million equivalenceclasses mentioning 575 million terms (624 of URIs)mdashan averageof 265 elements per equivalence class Of the 575 million termsonly 4156 were blank-nodes Fig 3 presents the distribution ofsizes of the equivalence classes where the largest equivalenceclass contains 8481 equivalent entities and 16 million (741)equivalence classes contain the minimum two equivalent identifi-ers Fig 4 shows a similar distribution but for the number of PLDsextracted from URIs in each equivalence class Interestingly themajority (571) of equivalence classes contained identifiers from

more than one PLD where the most diverse equivalence class con-tained URIs from 32 PLDs

Table 6 shows the canonical URIs for the largest five equiva-lence classes we manually inspected the results and showwhether or not the results were verified as correctincorrect In-deed results for class 1 and 2 were deemed incorrect due toover-use of owlsameAs for linking drug-related entities in theDailyMed and LinkedCT exporters Results 3 and 5 were verifiedas correct consolidation of prominent Semantic Web relatedauthors resp Dieter Fensel and Rudi Studermdashauthors are gi-ven many duplicate URIs by the RKBExplorer coreference index12

Result 4 contained URIs from various sites generally referring tothe United States mostly from DBPedia and LastFM With respectto the DBPedia URIs these (i) were equivalent but for capitilisationvariations or stop-words (ii) were variations of abbreviations or va-lid synonyms (iii) were different language versions (eg dbpediaEtats_Unis) (iv) were nicknames (eg dbpediaYankee_land)(v) were related but not equivalent (eg dbpediaAmerican_Civilization) (vi) were just noise (eg dbpediaLOL_Dean)

Besides the largest equivalence classes which we have seen areprone to errors perhaps due to the snowballing effect of the tran-sitive and symmetric closure we also manually evaluated a ran-dom sample of results For the sampling we take the closed

1

s24

Table 6Largest five equivalence classes (from [32])

Canonical term (lexically first in equivalence class) Size Correct

1 httpbio2rdforgdailymed_drugs1000 84812 httpbio2rdforgdailymed_drugs1042 8003 httpacmrkbexplorercomidperson-53292-22877d02973d0d01e8f29c7113776e7e 443 U

4 httpagame2teachcomddb61cae0e083f705f65944cc3bb3968ce3f3ab59-ge_1 353 U5 httpwwwacmrkbexplorercomidperson-236166-1b4ef5fdf4a5216256064c45a8923bc9 316 U

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 87

equivalence classes extracted from the full corpus pick an identi-fier (possibly non-canonical) at random and then randomly pickanother coreferent identifier from its equivalence class We thenretrieve the data associated for both identifiers and manually in-spect them to see if they are in fact equivalent For each pairwe then select one of four options SAME DIFFERENT UNCLEAR TRIVIALLY

SAME We used the UNCLEAR option sparingly and only where therewas not enough information to make an informed decision or fordifficult subjective choices note that where there was not enoughinformation in our corpus we tried to dereference more informa-tion to minimise use of UNCLEAR in terms of difficult subjectivechoices one example pair we marked unclear was dbpediaFolk-core and dbpediaExperimental_folk We used the TRIVIALLY

SAME option to indicate that an equivalence is purely lsquolsquosyntacticrsquorsquowhere one identifier has no information other than being the targetof a owlsameAs link although the identifiers are still coreferentwe distinguish this case since the equivalence is not so lsquolsquomeaning-fulrsquorsquo For the 1000 pairs we choose option TRIVIALLY SAME 661 times(661) SAME 301 times (301) DIFFERENT 28 times (28) and UN-

CLEAR 10 times (1) In summary our manual verification of thebaseline results puts the precision at 972 albeit with manysyntatic equivalences found Of those found to be incorrect manywere closely-related DBpedia resources that were linked to by thesame resource on the Freebase exporter an example of this wasdbpediaRock_Hudson and dbpediaRock_Hudson_filmogra-phy having owlsameAs links from the same Freebase conceptfbrock_hudson

Moving on in Table 7 we give the most frequently co-occurringPLD-pairs in our equivalence classes where datasets resident onthese domains are lsquolsquoheavilyrsquorsquo interlinked with owl sameAs rela-tions We italicise the indexes of inter-domain links that were ob-served to often be purely syntactic

With respect to consolidation identifiers in 786 million subjectpositions (7 of subject positions) and 232 million non-rdftype-object positions (26) were rewritten giving a total of1019 million positions rewritten (51 of total rewritable posi-tions) The average number of documents mentioning each URIrose slightly from 4691 to 4719 (a 06 increase) due to consoli-dation and the average number of PLDs also rose slightly from1005 to 1007 (a 02 increase)

Table 7Top 10 PLD pairs co-occurring in the equivalence classes with number of equivalenceclasses they co-occur in

PLD PLD Co-occur

1 rdfizecom uriburnercom 5066232 dbpediaorg freebasecom 1870543 bio2rdforg purlorg 1853924 locgov infolcauthorities

a 1666505 l3sde rkbexplorercom 1258426 bibsonomyorg l3sde 1258327 bibsonomyorg rkbexplorercom 1258308 dbpediaorg mpiide 998279 freebasecom mpiide 8961010 dbpediaorg umbelorg 65565

a In fact these were within the same PLD

5 Extended reasoning consolidation

We now look at extending the baseline approach to includemore expressive reasoning capabilities in particular using OWL 2RLRDF rules (introduced previously in Section 26) to infer novelowlsameAs relations that can then be used for consolidationalongside the explicit relations asserted by publishers

As described in Section 22 OWL provides various featuresmdashincluding functional properties inverse-functional properties andcertain cardinality restrictionsmdashthat allow for inferring novelowlsameAs data Such features could ease the burden on publish-ers of providing explicit owlsameAs mappings to coreferent iden-tifiers in remote naming schemes For example inverse-functionalproperties can be used in conjunction with legacy identificationschemesmdashsuch as ISBNs for books EAN UCC-13 or MPN for prod-ucts MAC addresses for network-enabled devices etcmdashto boot-strap identity on the Web of Data within certain domains suchidentification values can be encoded as simple datatype stringsthus bypassing the requirement for bespoke agreement or map-pings between URIs Also lsquolsquoinformation resourcesrsquorsquo with indigenousURIs can be used for indirectly identifying related resources towhich they have a functional mapping where examples includepersonal email-addresses personal homepages WIKIPEDIA articlesetc

Although Linked Data literature has not explicitly endorsed orencouraged such usage prominent grass-roots efforts publishingRDF on the Web rely (or have relied) on inverse-functional proper-ties for maintaining consistent identity For example in the FriendOf A Friend (FOAF) community a technique called smushing wasproposed to leverage such properties for identity serving as anearly precursor to methods described herein13

Versus the baseline consolidation approach we must first ap-ply some reasoning techniques to infer novel owlsameAs rela-tions Once we have the inferred owlsameAs data materialisedand merged with the asserted owlsameAs relations we can thenapply the same consolidation approach as per the baseline (i) ap-ply the transitive symmetric closure and generate the equivalenceclasses (ii) pick canonical identifiers for each class and (iii) re-write the corpus However in contrast to the baseline we expectthe extended volume of owlsameAs data to make in-memorystorage infeasible14 Thus in this section we avoid the need tostore equivalence information in-memory where we instead inves-tigate batch-processing methods for applying an on-disk closure ofthe equivalence relations and likewise on-disk methods for rewrit-ing data

Early versions of the local batch-processing [31] techniques andsome of the distributed reasoning techniques [33] are borrowedfrom previous works Herein we reformulate and combine thesetechniques into a complete solution for applying enhanced consol-idation in a distributed scalable manner over large-scale Linked

13 cf httpwikifoaf-projectorgwSmushing retr 2011012214 Further note that since all machines must have all owlsameAs information

adding more machines does not increase the capacity for handling more suchrelations the scalability of the baseline approach is limited by the machine with theleast in-memory capacity

88 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Data corpora We provide new evaluation over our 1118 billionquadruple evaluation corpus presenting new performance resultsover the full corpus and for a varying number of slave machinesWe also analyse in depth the results of applying the techniquesover our evaluation corpus analysing the applicability of ourmethods in such scenarios We contrast the results given by thebaseline and extended consolidation approaches and again manu-ally evaluate the results of the extended consolidation approach for1000 pairs found to be coreferent

51 High-level approach

In Table 2 we provide the pertinent rules for inferring new owl

sameAs relations from the data However after analysis of thedata we observed that no documents used the owlmaxQuali-

fiedCardinality construct required for the cls-maxqc rulesand that only one document defined one owlhasKey axiom15

involving properties with less than five occurrences in the datamdashhence we leave implementation of these rules for future work andnote that these new OWL 2 constructs have probably not yet had timeto find proper traction on the Web Thus on top of inferencing involv-ing explicit owlsameAs we are left with rule prp-fp which supportsthe semantics of properties typed owlFunctionalProperty andrule prp-ifp which supports the semantics of properties typedowlInverseFunctionalProperty and rule cls-maxc1 whichsupports the semantics of classes with a specified cardinality of 1for some defined property (a class restricted version of the func-tional-property inferencing)16

Thus we look at using OWL 2 RLRDF rules prp-fp prp-ifp andcls-maxc2 for inferring new owl sameAs relations between indi-vidualsmdashwe also support an additional rule that gives an exact car-dinality version of cls-maxc217 Herein we refer to these rules asconsolidation rules

We also investigate pre-applying more general OWL 2 RLRDFreasoning over the corpus to derive more complete results wherethe ruleset is available in Table 3 and is restricted to OWL 2 RLRDFrules with one assertional pattern [33] unlike the consolidationrules these general rules do not require the computation of asser-tional joins and thus are amenable to execution by means of a sin-gle-scan of the corpus We have demonstrated this profile ofreasoning to have good competency with respect to the features ofRDFS and OWL used in Linked Data [29] These general rules mayindirectly lead to the inference of additional owlsameAs relationsparticularly when combined with the consolidation rules of Table 2

Once we have used reasoning to derive novel owlsameAs

data we then apply an on-disk batch processing technique toclose the equivalence relation and to ultimately consolidate thecorpus

Thus our high-level approach is as follows

(i) extract relevant terminological data from the corpus(ii) bind the terminological patterns in the rules from this data

thus creating a larger set of general rules with only oneassertional pattern and identifying assertional patterns thatare useful for consolidation

(iii) apply general-rules over the corpus and buffer any inputinferred statements relevant for consolidation to a new file

(iv) derive the closure of owlsameAs statements from the con-solidation-relevant dataset

15 httphuemerlstadlernetrolerhrdf16 Note that we have presented non-distributed batch-processing execution of

these rules at a smaller scale (147 million quadruples) in [31]17 Exact cardinalities are disallowed in OWL 2 RL due to their effect on the formal

proposition of completeness underlying the profile but such considerations are mootin our scenario

(v) apply consolidation over the main corpus with respect to theclosed owlsameAs data

In Step (i) we extract terminological datamdashrequired for applica-tion of our rulesmdashfrom the main corpus As discussed in Section 25to help ensure robustness we apply authoritative reasoning inother words at this stage we discard any third-party terminologi-cal statements that affect inferencing over instance data for remoteclasses and properties

In Step (ii) we then compile the authoritative terminologicalstatements into the rules to generate a set of T-ground (purelyassertional) rules which can then be applied over the entirety ofthe corpus [33] As mentioned in Section 26 the T-ground generalrules only contain one assertional pattern and thus are amenableto execution by a single scan eg given the statement

homepage rdfssubPropertyOf isPrimaryTopicOf

we would generate a T-ground rulex homepage y

)x isPrimaryTopicOf y

which does not require the computation of joins We also bindthe terminological patterns of the consolidation rules in Table 2however these rules cannot be performed by means of a singlescan over the data since they contain multiple assertional pat-terns in the body Thus we instead extract triple patternssuch as

x isPrimaryTopicOf y

which may indicate data useful for consolidation (note that hereisPrimaryTopicOf is assumed to be inverse-functional)

In Step (iii) we then apply the T-ground general rules over thecorpus with a single scan Any input or inferred statements match-ing a consolidation-relevant patternmdashincluding any owlsameAs

statements foundmdashare buffered to a file for later join computationSubsequently in Step (iv) we must now compute the on-disk

canonicalised closure of the owlsameAs statements In particularwe mainly employ the following three on-disk primitives

(i) sequential scans of flat files containing line-delimitedtuples18

(ii) external-sorts where batches of statements are sorted inmemory the sorted batches written to disk and the sortedbatches merged to the final output and

(iii) merge-joins where multiple sets of data are sorted accordingto their required join position and subsequently scanned inan interleaving manner that aligns on the join position andwhere an in-memory join is applied for each individual joinelement

Using these primitives to compute the owlsameAs closureminimises the amount of main memory required where we havepresented similar approaches in [3031]

First assertional memberships of functional properties inverse-functional properties and cardinality restrictions (both propertiesand classes) are written to separate on-disk files For functional-property and cardinality reasoning a consistent join variable forthe assertional patterns is given by the subject position for in-verse-functional-property reasoning a join variable is given bythe object position19 Thus we can sort the former sets of data

8 These files are G-Zip compressed flat files of N-Triple-like syntax encodingrbitrary length tuples of RDF constants9 Although a predicate-position join is also available we prefer object-positionins that provide smaller batches of data for the in-memory join

1

a1

jo

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 89

according to subject and perform a merge-join by means of a linearscan thereafter the same procedure applies to the latter set sortingand merge-joining on the object position Applying merge-join scanswe produce new owlsameAs statements

Both the originally asserted and newly inferred owlsameAs

relations are similarly written to an on-disk file over which wenow wish to perform the canonicalised symmetrictransitive clo-sure We apply a similar method again leveraging external sortsand merge-joins to perform the computation (herein we sketchand point the interested reader to [31 Section 46]) In the follow-ing we use gt and lt to denote lexical ordering SA as a shortcut forowlsameAs a b c etc to denote members of U [ B such thata lt b lt c and define URIs to be lexically lower than blank nodesThe process is as follows

(i) we only materialise symmetric equality relations thatinvolve a (possibly intermediary) canonical term chosenby a lexical ordering given b SA a we materialise a SA bgiven a SA b SA c we materialise the relations a SA b aSA c and their inverses but do not materialise b SA c orits inverse

(ii) transitivity is supported by iterative merge-join scans

20 One could instead build an on-disk map for equivalence classes and pivoelements and follow a consolidation method similar to the previous section over theunordered data however we would expect such an on-disk index to have a low cachehit-rate given the nature of the data which would lead to many (slow) disk seeks Analternative approach might be to hash-partition the corpus and equality index overdifferent machines however this would require a non-trivial minimum amount omemory on the cluster

ndash in the scan if we find c SA a (sorted by object) and c SA d(sorted naturally) we infer a SA d and drop the non-canonical c SA d (and d SA c)

ndash at the end of the scan newly inferred triples are marked and merge-joined

into the main equality data any triples echoing an earlier inference are ignored dropped non-canonical statements are removed

ndash the process is then iterative in the next scan if we find dSA a and d SA e we infer a SA e and e SA a

ndash inferences will only occur if they involve a statementadded in the previous iteration ensuring that inferencesteps are not re-computed and that the computation willterminate

(iii) the above iterations stop when a fixpoint is reached andnothing new is inferred

(iv) the process of reaching a fixpoint is accelerated using avail-able main-memory to store a cache of partial equalitychains

The above steps follow well-known methods for transtive clo-sure computation modified to support the symmetry ofowlsameAs and to use adaptive canonicalisation in order toavoid quadratic output With respect to the last item we useAlgorithm 1 to derive lsquolsquobatchesrsquorsquo of in-memory equivalencesand when in-memory capacity is achieved we write thesebatches to disk and proceed with on-disk computation this isparticularly useful for computing the small number of longequality chains which would otherwise require sorts andmerge-joins over all of the canonical owlsameAs data currentlyderived and where the number of iterations would otherwise bethe length of the longest chain The result of this process is a setof canonicalised equality relations representing the symmetrictransitive closure

Next we briefly describe the process of canonicalising data withrespect to this on-disk equality closure where we again use exter-nal-sorts and merge-joins First we prune the owlsameAs indexto only maintain relations s1 SA s2 such that s1 gt s2 thus given s1

SA s2 we know that s2 is the canonical identifier and s1 is to berewritten We then sort the data according to the position thatwe wish to rewrite and perform a merge-join over both sets ofdata buffering the canonicalised data to an output file If we wantto rewrite multiple positions of a file of tuples (eg subject and

object) we must rewrite one position sort the results by the sec-ond position and then rewrite the second position20

Finally note that in the derivation of owlsameAs from the con-solidation rules prp-fp prp-ifp cax-maxc2 the overall processmay be iterative For example consider

dblpAxel foafisPrimaryTopicOf lthttppolleresnetgt

axelme foafisPrimaryTopicOf lthttpaxelderiiegt

lthttppolleresnetgt owlsameAs lthttpaxelderiiegt

from which the conclusion that dblpAxel is the same asaxelme holds We see that new owlsameAs relations (either as-serted or derived from the consolidation rules) may in turn lsquolsquoalignrsquorsquoterms in the join position of the consolidation rules leading to newequivalences Thus for deriving the final owlsameAs we require ahigher-level iterative process as follows

(i) initially apply the consolidation rules and append theresults to a file alongside the asserted owl sameAs state-ments found

(ii) apply the initial closure of the owlsameAs data(iii) then iteratively until no new owlsameAs inferences are

found

ndash rewrite the join positions of the on-disk files containing

the data for each consolidation rule according to the cur-rent owlsameAs data

ndash derive new owlsameAs inferences possible through theprevious rewriting for each consolidation rule

ndash re-derive the closure of the owlsameAs data includingthe new inferences

The final closed file of owlsameAs data can then be re-used torewrite the main corpus in two sorts and merge-join scans oversubject and object

52 Distributed approach

The distributed approach follows quite naturally from theprevious discussion As before we assume that the input dataare evenly pre-distributed over the slave machines (in anyarbitrary ordering) where we can then apply the followingprocess

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract relevant terminologicalknowledge

(ii) Gather gather terminological data onto the master machineand thereafter bind the terminological patterns of the gen-eralconsolidation rules

(iii) Flood flood the rules for reasoning and the consolidation-relevant patterns to all slave machines

(iv) Run apply reasoning and extract consolidation-relevantstatements from the input and inferred data

(v) Gather gather all consolidation statements onto the mastermachine then in parallel

t

f

Table 8Breakdown of timing of distributed extended consolidation wreasoning where identifies the masterslave tasks that run concurrently

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 4543 1000 2302 1000 1275 1000 813 1000 7404 1000Executing 299 66 303 131 325 255 301 370 2436 329

Aggregate Consolidation Relevant Data 43 09 45 20 47 37 46 57 299 40Closing owlsameAs

244 54 245 106 254 199 244 300 2112 285Miscellaneous 12 03 13 05 24 19 11 13 25 03

Idle (waiting for slaves) 4245 934 1999 869 950 745 512 630 4968 671

SlaveAvg Executing (total) 4489 988 2212 961 1116 875 565 694 5991 809

Extract Terminology 444 98 227 99 118 92 69 84 494 67Extract Consolidation Relevant Data 955 210 483 210 243 191 132 162 1379 186Initial Sort (by subject) 980 216 483 210 237 186 116 142 1338 181Consolidation 2110 464 1019 443 519 407 249 306 2780 375

Avg Idle 55 12 89 39 379 297 236 290 1413 191Waiting for peers 00 00 32 14 71 56 63 77 375 51Waiting for master 55 12 58 25 308 241 173 212 1038 140

90 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

ndash Local compute the closure of the consolidation rules andthe owlsameAs data on the master machine

ndash Run each slave machine sorts its fragment of the maincorpus according to natural order (spoc)

1 At least in terms of pure quantity However we do not give an indication of theuality or importance of those few equivalences we miss with this approximationhich may well be application specific

(vi) Flood send the closed owlsameAs data to the slavemachines once the distributed sort has been completed

(vii) Run each slave machine then rewrites the subjects of theirsegment of the corpus subsequently sorts the rewritten databy object and then rewrites the objects (of non-rdftypetriples) with respect to the closed owlsameAs data

53 Performance evaluation

Applying the above process to our 1118 billion quadruple corpustook 1234 h Extracting the terminological data took 114 h with anaverage idle time of 19 min (277) (one machine took 18 minlonger than the rest due to processing a large ontology containing2 million quadruples [32]) Merging and aggregating the termino-logical data took roughly 1 min Applying the reasoning andextracting the consolidation relevant statements took 234 h withan average idle time of 25 min (18) Aggregating and mergingthe consolidation relevant statements took 299 min Thereafterlocally computing the closure of the consolidation rules and theequality data took 352 h with the computation requiring two itera-tions overall (the minimum possiblemdashthe second iteration did notproduce any new results) Concurrent to the previous step the paral-lel sort of remote data by natural order took 233 h with an averageidle time of 6 min (43) Subsequent parallel consolidation of thedata took 48 h with 10 min (35) average idle timemdashof this 19of the time was spent consolidating the pre-sorted subjects 60of the time was spent sorting the rewritten data by object and21 of the time was spent consolidating the objects of the data

As before Table 8 summarises the timing of the task Focusing onthe performance for the entire corpus the master machine took406 h to coordinate global knowledge constituting the lower boundon time possible for the task to execute with respect to increasingmachines in our setupmdashin future it may be worthwhile to investigatedistributed strategies for computing the owlsameAs closure(which takes 285 of the total computation time) but for the mo-ment we mitigate the cost by concurrently running a sort on theslave machines thus keeping the slaves busy for 634 of the timetaken for this local aggregation step The slave machines were onaverage busy for 809 of the total task time of the idle time733 was spent waiting for the master machine to aggregate theconsolidation relevant data and to finish the closure of owl sameAs

data and the balance (267) was spent waiting for peers to finish(mostly during the extraction of terminological data)

With regards performance evaluation over the 100 million qua-druple corpus for a varying number of slave machines perhapsmost notably the average idle time of the slave machines increasesquite rapidly as the number of machines increases In particular by4 machines the slaves can perform the distributed sort-by-subjecttask faster than the master machine can perform the local owlsameAs closure The significant workload of the master machinewill eventually become a bottleneck as the number of slave ma-chines increases further This is also noticeable in terms of overalltask time the respective time savings as the number of slave ma-chines doubles (moving from 1 slave machine to 8) are 0507 0554 and 0638 respectively

Briefly we also ran the consolidation without the general rea-soning rules (Table 3) motivated earlier With respect to perfor-mance the main variations were given by (i) the extraction ofconsolidation relevant statementsmdashthis time directly extractedfrom explicit statements as opposed to explicit and inferred state-mentsmdashwhich took 154 min (11 of the time taken including thegeneral reasoning) with an average idle time of less than one min-ute (6 average idle time) (ii) local aggregation of the consolida-tion relevant statements took 17 min (569 of the time takenpreviously) (iii) local closure of the owlsameAs data took318 h (904 of the time taken previously) The total time savedequated to 28h (227) where 333 min were saved from coordi-nation on the master machine and 225 h were saved from parallelexecution on the slave machines

54 Results evaluation

In this section we present the results of the consolidationwhich included the general reasoning step in the extraction ofthe consolidation-relevant statements In fact we found that theonly major variation between the two approaches was in theamount of consolidation-relevant statements collected (discussedpresently) where the changes in the total amounts of coreferentidentifiers equivalence classes and canonicalised terms were infact negligible (lt01) Thus for our corpus extracting only as-serted consolidation-relevant statements offered a very closeapproximation of the extended reasoning approach21

2

qw

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 3: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

78 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

gives a high percentage of correct results and leads to a clean con-solidated corpus over an approach that gives a higher percentage ofconsolidated results but leads to a partially garbled corpus simi-larly we prefer a system that can handle more data (is more scal-able) but may possibly have a lower throughput (is less efficient)4

Thus herein we revisit methods for scalable precise automaticand domain-agnostic entity consolidation over large static LinkedData corpora In order to make our methods scalable we avoid dy-namic on-disk index structures and instead opt for algorithms thatrely on sequential on-disk readswrites of compressed flat filesusing operations such as scans external sorts merge-joins andonly light-weight or non-critical in-memory indices In order tomake our methods efficient we demonstrate distributed imple-mentations of our methods over a cluster of shared-nothing com-modity hardware where our algorithms attempt to maximise theportion of time spent in embarrassingly parallel executionmdashieparallel independent computation without need for inter-machinecoordination In order to make our methods domain-agnostic anditfully-automatic we exploit the generic formal semantics of thedata described in RDF(S)OWL and also generic statistics derivablefrom the corpus In order to achieve high recall we incorporateadditional OWL features to find novel coreference through reason-ing Aiming at high precision we introduce methods that again ex-ploit the semantics and also the statistics of the data but toconversely disambiguate entities to defeat equivalences found inthe previous step that are unlikely to be true according to somecriteria

As such extending upon various reasoning and consolidationtechniques described in previous works [30313334] we now givea self-contained treatment of our results in this area In additionwe distribute the execution of all methods over a cluster of com-modity hardware we provide novel detailed performance andquality evaluation over a large real-world Linked Data corpus ofone billion statements and we also examine exploratory tech-niques for interlinking similar entities based on statistical analysesas well as methods for disambiguating and repairing incorrectcoreferences

In summary in this paper we

ndash provide some necessary preliminaries and describe our distrib-uted architecture (Section 2)

ndash characterise the 1 billion quadruple Linked Data corpus thatwill be used for later evaluation of our methods particularfocusing on the (re-)use of data-level identifiers in the corpus(Section 3)

ndash describe and evaluate our distributed base-line approach forconsolidation which leverages explicit owlsameAs relations(Section 4)

ndash describe and evaluate a distributed approach that extends con-solidation to consider a richer OWL semantics for consolidating(Section 5)

ndash present a distributed algorithm for determining weighted con-currence between entities using statistical analysis of predi-cates in the corpus (Section 6)

ndash present a distributed approach to disambiguate entitiesmdashiedetect likely erroneous consolidationmdashcombining the semanticsand statistics derivable from the corpus (Section 7)

ndash provide critical discussion (Section 8) render related work(Section 9) and conclude with discussion (Section 10)

4 Of course entity consolidation has many practical applications outside of ourreferential use-case SWSE and is useful in many generic query-answering secnariosmdashwe see these requirements as being somewhat fundamental to a consolidationcomponent to varying degrees

2 Preliminaries

In this section we provide some necessary preliminaries relat-ing to (i) RDF the structured format used in our corpora(Section 21) (ii) RDFSOWL which provides formal semantics toRDF data including the semantics of equality (Section 22) (iii)rules which are used to interpret the semantics of RDF and applyreasoning (Sections 23 24) (iv) authoritative reasoning whichcritically examines the source of certain types of Linked Data tohelp ensure robustness of reasoning (Section 25) (v) and OWL 2RLRDF rules a standard rule-based profile for supporting OWLsemantics a subset of which we use to deduce equality (Sec-tion 26) Throughout we attempt to preserve notation and termi-nology as prevalent in the literature

21 RDF

We briefly give some necessary notation relating to RDF con-stants and RDF triples see [28]

RDF constant Given the set of URI references U the set of blanknodes B and the set of literals L the set of RDF constants is denotedby C = U [ B [ L As opposed to the RDF-mandated existentialsemantics for blank-nodes we interpret blank-nodes as groundSkolem constants note also that we rewrite blank-node labels toensure uniqueness per document as per RDF merging [28]

RDF triple A triple t = (spo) 2 G = (U [ B) U C is called anRDF triple where s is called subject p predicate and o object Wecall a finite set of triples G G a graph This notion of a triple re-stricts the positions in which certain terms may appear We usethe phrase generalised triple where such restrictions are relaxedand where literals and blank-nodes are allowed to appear in thesubjectpredicate positions

RDF triple in contextquadruple Given c 2 U let http(c) denotethe possibly empty graph Gc given by retrieving the URI c throughHTTP (directly returning 200 Okay) An ordered pair (tc) with anRDF triple t = (spo) c 2U and t 2http(c) is called a triple in contextc We may also refer to (spoc) as an RDF quadruple or quad q withcontext c

22 RDFS OWL and owlsameAs

Conceptually RDF data is composed of assertional data (ie in-stance data) and terminological data (ie schema data) Assertionaldata define relationships between individuals provide literal-valued attributes of those individuals and declare individuals asmembers of classes Thereafter terminological data describe thoseclasses relationships (object properties) and attributes (datatypeproperties) and declaratively assigns them a semantics5 Withwell-defined descriptions of classes and properties reasoning canthen allow for deriving new knowledge including over assertionaldata

On the Semantic Web the RDFS [11] and OWL [42] standardsare prominently used for making statements with well-definedmeaning and enabling such reasoning Although primarilyconcerned with describing terminological knowledgemdashsuch assubsumption or equivalence between classes and propertiesetcmdashRDFS and OWL also contain features that operate on an asser-tional level The most interesting such feature for our purposes isowlsameAs a core OWL property that allows for defining equiva-lences between individuals (as well as classes and properties) Twoindividuals related by means of owlsameAs are interpreted asreferring to the same entity ie they are coreferent

5 Terminological and assertional data are not by any means necessarily disjointrthermore this distinction is not always required but is useful for our purposes

erein

Fuh

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 79

Interestingly OWL also contains other features (on a termino-logical level) that allow for inferring new owlsameAs relationsSuch features include

ndash inverse-functional properties which act as lsquolsquokey propertiesrsquorsquo foruniquely identifying subjects eg an isbn datatype-propertywhose values uniquely identifies books and other documents orthe object-property biologicalMotherOf which uniquelyidentifies the biological mother of a particular child

ndash functional properties which work in the inverse direction anduniquely identify objects eg the property hasBiological

Motherndash cardinality and max-cardinality restrictions which when given

a value of 1 uniquely identify subjects of a given class eg thevalue for the property spouse uniquely identifies a member ofthe class Monogamist

Thus we see that RDFS and OWL contain various features thatallow for inferring and supporting coreference between individu-als In order to support the semantics of such features we applyinference rules introduced next

23 Inference rules

Herein we formalise the notion of an inference rule as applicableover RDF graphs [33] We begin by defining the preliminary con-cept of a triple pattern of which rules are composed and whichmay contain variables in any position

Triple pattern basic graph pattern A triple pattern is a generalisedtriple where variables from the set V are allowed ie tv = (svpvov)2GV GV = (C [ V) (C [ V) (C [ V) We call a set (to be read asconjunction) of triple patterns GV GV a basic graph pattern We de-note the set of variables in graph pattern GV by VethGVTHORN

Intuitively variables appearing in triple patterns can be boundby any RDF term in C Such a mapping from variables to constantsis called a variable binding

Variable bindings Let X be the set of variable bindingV [ C V [ C that map every constant c 2 C to itself and every var-iable v 2 V to an element of the set C [ V A triple t is a binding of atriple pattern tv = (svpvov) iff there exists l 2X such thatt = l(tv) = (l(sv) l(pv) l(ov)) A graph G is a binding of a graph pat-tern GV iff there exists a mapping l 2X such that

Stv2GVlethtv THORN frac14 G

we use the shorthand lethGVTHORN frac14 G We use XethGVGTHORN frac14 fl j lethGVTHORNGlethvTHORN frac14 v if v R VethGVTHORNg to denote the set of variable bindingmappings for graph pattern GV in graph G that map variables out-side GV to themselves

We can now give the notion of an inference rule which is com-prised of a pair of basic graph patterns that form an lsquolsquoIF) THENrsquorsquological structure

Inference rule We define an inference rule (or often just rule) as apair r frac14 ethAnter ConrTHORN where the antecedent (or body) Anter GV

and the consequent (or head) Conr GV are basic graph patternssuch that all variables in the consequent appear in the antecedentVethConrTHORN VethAnterTHORN We write inference rules as Anter ) Conr

Finally rules allow for applying inference through rule applica-tions where the rule body is used to generate variable bindingsagainst a given RDF graph and where those bindings are appliedon the head to generate new triples that graph entails

Rule application and standard closure A rule application is theimmediate consequences TrethGTHORN frac14

Sl2XethAnter GTHORNethlethConrTHORN n lethAnterTHORNTHORN

of a rule r on a graph G accordingly for a ruleset RTRethGTHORN frac14

Sr2RTrethGTHORN Now let Githorn1 frac14 Gi [ TRethGiTHORN and G0 frac14 G the

exhaustive application of the TR operator on a graph G is then theleast fixpoint (the smallest value for n) such that Gn frac14 TRethGnTHORNWe call Gn the closure of G wrt ruleset R denoted as ClRethGTHORN or suc-cinctly G where the ruleset is obvious

The above closure takes a graph and a ruleset and recursivelyapplies the rules over the union of the original graph and the infer-ences until a fixpoint

24 T-split rules

In [3133] we formalised an optimisation based on a separationof terminological knowledge from assertional data when applyingrules Similar optimisations have also been used by other authorsto enable large-scale rule-based inferencing [747143] This opti-misation is based on the premise that in Linked Data (and variousother scenarios) terminological data speaking about classes andproperties is much smaller than assertional data speaking aboutindividuals Previous observations indicate that for a Linked Datacorpus in the order of a billion triples such terminological datacomprises of about 01 of the total data volume Additionally ter-minological data is very frequently accessed during reasoningHence seperating and storing the terminological data in memoryallows for more efficient execution of rules albeit at the cost ofpossible incompleteness [7433]

We now reintroduce some of the concepts relating to separatingterminological information [33] which we will use later whenapplying rule-based reasoning to support the semantics of OWLequality We begin by formalising some related concepts that helpdefine our notion of terminological information

Meta-class We consider a meta-class as a class whose membersare themselves (always) classes or properties Herein we restrictour notion of meta-classes to the set defined in RDF(S) andOWL specifications where examples include rdfPropertyrdfsClass owlDatatypeProperty owlFunctionalProp-

erty etc Note that eg owlThing and rdfsResource are notmeta-classes since not all of their members are classes or properties

Meta-property A meta-property is one that has a meta-class asits domain again we restrict our notion of meta-properties tothe set defined in RDF(S) and OWL specifications where examplesinclude rdfsdomain rdfsrange rdfssubClassOfowlhasKey owlinverseOf owloneOf owlonPropertyowlunionOf etc Note that rdftype owlsameAs rdfslabeleg do not have a meta-class as domain and so are not consideredas meta-properties

Terminological triple We define the set of terminological triplesT G as the union of

(i) triples with rdftype as predicate and a meta-class asobject

(ii) triples with a meta-property as predicate(iii) triples forming a valid RDF list whose head is the object of a

meta-property (eg a list used for owlunionOf etc)

The above concepts cover what we mean by terminological datain the context of RDFS and OWL Next we define the notion of tri-ple patterns that distinguish between these different categories ofdata

Terminologicalassertional pattern We refer to a terminological-triple-graph pattern as one whose instance can only be a termino-logical triple or resp a set thereof An assertional pattern is anypattern that is not terminological

Given the above notions of terminological datapatterns we cannow define the notion of a T-split inference rule that distinguishesterminological from assertional information

T-split inference rule Given a rule r frac14 ethAnter Conr) we define aT-split rule rs as the triple (AnteT

rs AnteGrs Con) where AnteT

rs isthe set of terminological patterns in Anter and AnteG

rs frac14Anter nAnteT

rs The body of T-split rules are divided into a set of patterns that

apply over terminological knowledge and a set of patterns that

80 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

apply over assertional (ie any) knowledge Such rules enable anoptimisation whereby terminological patterns are pre-bound togenerate a new larger set of purely assertional rules called T-ground rules We illustrate this with an example

Example Let Rprp-ifp denote the following rule

(p a owl InverseFunctionalProperty)(x1 p y)(x2 p y))(x1 owl sameAs x2)

When writing T-split rules we denote terminological patternsin the body by underlining Also we use lsquoarsquo as a convenient short-cut for rdftype which indicates class-membership Now take theterminological triples

(isbn a owlInverseFunctionalProperty)(mbox a owlInverseFunctionalProperty)

From these two triples we can generate two T-ground rules asfollows

(x1 isbn y)(x2 isbn y))(x1 owl sameAs x2)

(x1 mbox y)(x2 mbox y))(x1 owl sameAs x2)

These T-ground rules encode the terminological OWL knowledgegiven by the previous two triples and can be applied directly overthe assertional data eg describing specific books with ISBNnumbers

25 Authoritative T-split rules

Caution must be exercised when applying reasoning over arbi-trary data collected from the Web eg Linked Data In previousworks we have encountered various problems when naiumlvely per-forming rule-based reasoning over arbitrary Linked Data [3133]In particular third-party claims made about popular classes andproperties must be critically analysed to ensure their trustworthi-ness As a single example of the type of claims that can cause issueswith reasoning we found one document that defines nine propertiesas the domain of rdftype6 thereafter the semantics ofrdfsdomainmandate that everything which is a member of any class(ie almost every known resource) can be inferred to be a lsquolsquomemberrsquorsquo ofeach of the nine properties We found that naiumlve reasoning lead to200 more inferences than would be expected when consideringthe definitions of classes and properties as defined in their lsquolsquoname-space documentsrsquorsquo Thus in the general case performing rule-basedmaterialisation with respect to OWL semantics over arbitrary LinkedData requires some critical analysis of the source of data as per ourauthoritative reasoning algorithm Please see [9] for a detailed analysisof the explosion in inferences that occurs when authoritative analysisis not applied

Such observations prompted us to investigate more robustforms of reasoning Along these lines when reasoning over LinkedData we apply an algorithm called authoritative reasoning[143133] which critically examines the source of terminologicaltriples and conservatively rejects those that cannot be definitivelytrusted ie those that redefine classes andor properties outside ofthe namespace document

6 viz httpwwweiaonetrdf10

Dereferencing and authority We first give the functionhttpU 2G that maps a URI to an RDF graph it returns upon aHTTP lookup that returns the response code 200 Okay in the caseof failure the function maps to an empty RDF graph Next we de-fine the function redirU U that follows a (finite) sequence ofHTTP redirects given upon lookup of a URI or that returns theURI itself in the absence of a redirect or in the case of failure thisfunction first strips the fragment identifier (given after the lsquorsquo char-acter) from the original URI if present Then we give the dereferenc-ing function derefhttpredir as the mapping from a URI (a Weblocation) to an RDF graph it may provide by means of a given HTTPlookup that follows redirects Finally we can define the authorityfunctionmdashthat maps the URI of a Web document to the set of RDFterms it is authoritative formdashas follows

auth U 2C

u fc 2 U j redirethcTHORN frac14 ug [ BethhttpethsTHORNTHORNwhere B(G) denotes the set of blank-nodes appearing in a graph GIn other words a Web document is authoritative for URIs that dere-ference to it and the blank nodes it contains

We note that making RDF vocabulary URIs dereferenceable isencouraged in various best-practices [454] To enforce authorita-tive reasoning when applying rules with terminological and asser-tional patterns in the body we require that terminological triplesare served by a document authoritative for terms that are boundby specific variable positions in the rule

Authoritative T-split rule application Given a T-split rule and arule application involving a mapping l as before for the rule appli-cation to be authoritative there must additionally exist a l(v) suchthat v 2 VethAnteT THORN VethAnteGTHORN l(v) 2 auth(u) lethAnteT THORN httpethuTHORNWe call the set of variables that appear in both the terminologicaland assertional segment of the rule body (ie VethAnteT THORN VethAnteGTHORN) the set of authoritative variables for the rule Where a ruledoes not have terminological or assertional patterns we considerany rule-application as authoritative The notation of an authorita-tive T-ground rule follows likewise

Example Let Rprp-ifp denote the same rule as used in the previousexample and let the following two triples be given by the derefer-enced document of v1isbn but not v2mboxmdashie a documentauthoritative for the former term but not the latter

(v1isbn a owlInverseFunctionalProperty)(v2mbox a owlInverseFunctionalProperty)

The only authoritative variable appearing in both the termino-logical and assertional patterns of the body of Rprp-ifp is p boundby v1isbn and v2mbox above Since the document serving thetwo triples is authoritative for v1isbn the first triple is consid-ered authoritative and the following T-ground rule will begenerated

(x1 v1isbn y)(x2 v1isbn y))(x1 owlsameAs x2)

However since the document is not authoritative for v2mboxthe second T-ground rule will be considered non-authoritative anddiscarded

(x1 v2mbox y)(x2 v2mbox y))(x1 owlsameAs x2)

where third-party documents re-define remote terms in a way thataffects inferencing over instance data we filter such non-authoritative definitions

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 81

26 OWL 2 RLRDF rules

Inference rules can be used to (partially) support the semanticsof RDFS and OWL Along these lines the OWL 2 RLRDF [24] rulesetis a partial-axiomatisation of the OWL 2 RDF-Based Semanticswhich is applicable for arbitrary RDF graphs and constitutes anextension of the RDF Semantics [28] in other words the rulesetsupports a standardised profile of reasoning that partially coversthe complete OWL semantics Interestingly for our scenario thisprofile includes inference rules that support the semantics andinference of owlsameAs as discussed

First in Table 1 we provide the set of OWL 2 RLRDF rules thatsupport the (positive) semantics of owl sameAs axiomatising thesymmetry (rule eq-sym) and transitivity (rule eq-trans) of therelation To take an example rule eq-sym is as follows

(x owlsameAs y))(y owlsameAs x)

where if we know that (a owlsameAs b) the rule will inferthe symmetric relation (b owlsameAs a) Rule eq-trans oper-ates analogously both together allow for computing the transitivesymmetric closure of the equivalence relation Furthermore Table 1also contains the OWL 2 RLRDF rules that support the semantics ofreplacement (rules eq-rep-) whereby data that holds for one en-tity must also hold for equivalent entities

Note that we (optionally and in the case of later evaluation)choose not to support

(i) the reflexive semantics of owlsameAs since reflexiveowlsameAs statements will not lead to any consolidationor other non-reflexive equality relations and will produce alarge bulk of materialisations

(ii) equality on predicates or values for rdftype where we donot want possibly imprecise owlsameAs data to affectterms in these positions

Given that the semantics of equality is quadratic with respect tothe A-Box we apply a partial-materialisation approach that givesour notion of consolidation instead of materialising (ie explicitlywriting down) all possible inferences given by the semantics ofreplacement we instead choose one canonical identifier to repre-sent the set of equivalent terms thus effectively compressing thedata We have used this approach in previous works [30ndash32] andit has also appeared in related works in the literature [706] as acommon-sense optimisation for handling data-level equality Totake an example in later evaluation (cf Table 10) we find a validequivalence class (set of equivalent entities) with 33052 mem-bers materialising all pairwise non-reflexive owlsameAs state-ments would infer more than 1 billion owlsameAs relations(330522 33052) = 1092401652 further assuming that eachentity appeared in on average eg two quadruples we would inferan additional 2 billion of massively duplicated data By choosinga single canonical identifier we would instead only materialise100 thousand statements

Note that although we only perform partial materialisation wedo not change the semantics of equality our methods are sound

Table 1Rules that support the positive semantics of owl sameAs

OWL2RL Antecedent ConsequentAssertional

eq-sym x owlsameAs y y owlsameAs x eq-trans x owlsameAs y y owlsameAs z x owlsameAs z eq-rep-s s owlsameAs s0 s p o s0 p o eq-rep-o o owlsameAs o0 s p o s p o0

with respect to OWL semantics In addition alongside the partiallymaterialised data we provide a set of consolidated owl sameAs

relations (containing all of the identifiers in each equivalence class)that can be used to lsquolsquobackward-chainrsquorsquo the full inferences possiblethrough replacement (as required) Thus we do not consider thecanonical identifier as somehow lsquodefinitiversquo or superceding theother identifiers but merely consider it as representing the equiva-lence class7

Finally herein we do not consider consolidation of literalsone may consider useful applications eg for canonicalisingdatatype literals but such discussion is out of the currentscope

As previously mentioned OWL 2 RLRDF also contains rules thatuse terminological knowledge (alongside assertional knowledge)to directly infer owl sameAs relations We enumerate these rulesin Table 2 note that we italicise the labels of rules supporting fea-tures new to OWL 2 and that we list authoritative variables inbold8

Applying only these OWL 2 RLRDF rules may miss inference ofsome owlsameAs statements For example consider the exampleRDF data given in Turtle syntax [3] as follows

From the FOAF Vocabulary

foafhomepage rdfssubPropertyOf

foafisPrimaryTopicOf

foafisPrimaryTopicOf owlinverseOf

foafprimaryTopic

foafisPrimaryTopicOf a

owlInverseFunctionalProperty

From Example Document A

exAaxel foafhomepage lthttppolleresnetgt From Example Document B

an

apclsvo

lthttppolleresnetgt foafprimaryTopic exBapolleres

Inferred through prp-spo1

exAaxel foafisPrimaryTopicOf lthttppolleresnetgt

Inferred through prp-invexBapolleres foafisPrimaryTopicOf lthttppolleresnetgt

Subsequently inferred through prp-ifpexAaxel owlsameAs exBapolleres

exBapolleres owlsameAs exAaxel

The example uses properties from the prominent FOAF vocabu-lary which is used for publishing personal profiles as RDF Here weadditionally need OWL 2 RLRDF rules prp-inv and prp-spo1mdashhandling standard owlinverseOf and rdfssubPropertyOf

inferencing respectivelymdashto infer the owlsameAs relationentailed by the data

Along these lines we support an extended set of OWL 2 RLRDFrules that contain precisely one assertional pattern and for whichwe have demonstrated a scalable implementation called SAORdesigned for Linked Data reasoning [33] These rules are listed inTable 3 and as per the previous example can generateadditional inferences that indirectly lead to the derivation of newowl sameAs data We will use these rules for entity consolidationin Section 5

Lastly as we discuss later (particularly in Section 7) Linked Datais inherently noisy and prone to mistakes Although our methods

7 We may optionally consider non-canonical blank-node identifiers as redundand discard them

8 As discussed later our current implementation requires at least one variable topear in all assertional patterns in the body and so we do not support prp-key and-maxqc3 however these two features are not commonly used in Linked Datacabularies [29]

t

Table 2OWL 2 RLRDF rules that directly produce owl sameAs relations we denote authoritative variables with bold and italicise the labels of rules requiring new OWL 2 constructs

OWL2RL Antecedent Consequent

Terminological Assertional

prp-fp p a owlFunctionalProperty x p y1 y2 y1 owlsameAs y2

prp-ifp p a owlInverseFunctionalProperty x1 p y x2 p y x1 owlsameAs x2

cls-maxc2 x owlmaxCardinality 1 u a x y1 owlsameAs y2

x owlonProperty p u p y1 y2

cls-maxqc4 x owlmaxQualifiedCardinality 1 u a x y1 owlsameAs y2

x owlonProperty p u p y1 y2

x owlonClass owl Thing

Table 3OWL 2 RLRDF rules containing precisely one assertional pattern in the body with authoritative variables in bold (cf[33])

OWL2RL Antecedent Consequent

Terminological Assertional

prp-dom p rdfsdomain c x p y x a cprp-rng p rdfsrange c x p y y a cprp-symp p a owlSymmetricProperty x p y y p xprp-spo1 p1 rdfssubPropertyOf p2 x p1 y x p2 yprp-eqp1 p1 owlequivalentProperty p2 x p1 y x p2 yprp-eqp2 p1 owlequivalentProperty p2 x p2 y x p1 yprp-inv1 p1 owlinverseOf p2 x p1 y y p2 xprp-inv2 p1 owlinverseOf p2 x p2 y y p1 xcls-int2 c owlintersectionOf (c1 cn) x a c x a c1 cn

cls-uni c owlunionOf (c1 ci cn) x a ci x a ccls-svf2 x owlsomeValuesFrom owlThing owlonProperty p u p v u a xcls-hv1 x owlhasValue y owlonProperty p u a x u p ycls-hv2 x owlhasValue y owlonProperty p u p y u a xcax-sco c1 rdfssubClassOf c2 x a c1 x a c2

cax-eqc1 c1 owlequivalentClass c2 x a c1 x a c2

cax-eqc2 c1 owlequivalentClass c2 x a c2 x a c1

Table 4OWL 2 RLRDF rules used to detect inconsistency that we currently support wedenote authoritative variables with bold

OWL2RL Antecedent

Terminological Assertional

eq-diff1 ndash x owlsameAs yx owldifferentFrom y

prp-irp p a owlIrreflexiveProperty x p xprp-asyp p a owlAsymmetricProperty x p it y y p xprp-pdw p1owlpropertyDisjointWith p2 x p1 y p2 yprp-adp x a owlAllDisjointProperties u pi y pj y (i ndash j)

x owlmembers (p1 pn)cls-com c1 owlcomplementOf c2 x a c1 c2

cls-maxc1 x owlmaxCardinality 0 u a x p yx owlonProperty p

cls-maxqc2 x owlmaxQualifiedCardinality 0 u a x p yx owlonProperty px owlonClass owlThing

cax-dw c1 owldisjointWith c2 x a c1c2cax-adc x a owlAllDisjointClasses z a ci cj (i ndash j)

x owlmembers (c1 cn)

82 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

are sound (ie correct) with respect to formal OWL semanticssuch noise in our input data may lead to unintended consequenceswhen applying reasoning which we wish to minimise andorrepair Relatedly OWL contains features that allow for detectingformal contradictionsmdashcalled inconsistenciesmdashin RDF data Anexample of an inconsistency would be where something is amember of two disjoint classes such as foafPerson andfoafOrganization formally the intersection of such disjointclasses should be empty and they should not share membersAlong these lines OWL 2 RLRDF contains rules (with the specialsymbol false in the head) that allow for detecting inconsistencyin an RDF graph We use a subset of these consistency-checkingOWL 2 RLRDF rules later in Section 7 to try to automatically detectand repair unintended owlsameAs inferences the supportedsubset is listed in Table 4

27 Distribution architecture

To help meet our scalability requirements our methods areimplemented on a shared-nothing distributed architecture [64]over a cluster of commodity hardware The distributed frameworkconsists of a master machine that orchestrates the given tasks andseveral slave machines that perform parts of the task in parallel

The master machine can instigate the following distributedoperations

ndash Scatter partition on-disk data using some local split functionand send each chunk to individual slave machines for subse-quent processing

ndash Run request the parallel execution of a task by the slavemachinesmdashsuch a task either involves processing of somedata local to the slave machine or the coordinate method(described later) for reorganising the data under analysis

ndash Gather gathers chunks of output data from the slave swarmand performs some local merge function over the data

ndash Flood broadcast global knowledge required by all slavemachines for a future task

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 83

The master machine provides input data to the slave swarmprovides the control logic required by the distributed task (com-mencing tasks coordinating timing ending tasks) gathers andlocally perform tasks on global knowledge that the slave machineswould otherwise have to replicate in parallel and transmits glob-ally required knowledge

The slave machines as well as performing tasks in parallel canperform the following distributed operation (at the behest of themaster machine)

ndash Coordinate local data on each slave machine is partitionedaccording to some split function with the chunks sent toindividual machines in parallel each slave machine alsogathers the incoming chunks in parallel using some mergefunction

The above operation allows slave machines to reorganise(splitsendgather) intermediary amongst themselves the coor-dinate operation could be replaced by a pair of gatherscatteroperations performed by the master machine but we wish toavoid the channelling of all intermediary data through onemachine

Note that herein we assume that the input corpus is evenlydistributed and split across the slave machines and that theslave machines have roughly even specifications that is we donot consider any special form of load balancing but insteadaim to have uniform machines processing comparable data-chunks

We note that there is the practical issue of the master ma-chine being idle waiting for the slaves and more critically thepotentially large cluster of slave machines waiting idle for themaster machine One could overcome idle times with maturetask-scheduling (eg interleaving jobs) and load-balancing Froman algorithmic point of view removing the central coordinationon the master machine may enable better distributability Onepossibility would be to allow the slave machines to duplicatethe aggregation of global knowledge in parallel although thiswould free up the master machine and would probably takeroughly the same time duplicating computation wastes resourcesthat could otherwise be exploited by eg interleaving jobs Asecond possibility would be to avoid the requirement for globalknowledge and to coordinate upon the larger corpus (eg a coor-dinate function hashing on the subject and object of the data orperhaps an adaptation of the SpeedDate routing strategy [51])Such decisions are heavily influenced by the scale of the taskto perform the percentage of knowledge that is globally requiredhow the input data are distributed how the output data shouldbe distributed and the nature of the cluster over which it shouldbe performed and the task-scheduling possible The distributedimplementation of our tasks are designed to exploit a relativelysmall percentage of global knowledge which is cheap to coordi-nate and we choose to avoidmdashinsofar as reasonablemdashduplicatingcomputation

28 Experimental setup

Our entire code-base is implemented on top of standard Javalibraries We instantiate the distributed architecture using JavaRMI libraries and using the lightweight open-source Java RMIIOpackage9 for streaming data for the network

9 httpopenhmssourceforgenetrmiio

10 We observe eg a max FTP transfer rate of 38MBs between machines11 Please see [32 Appendix A] for further statistics relating to this corpus

All of our evaluation is based on nine machines connected byGigabit ethernet10 each with uniform specifications viz 22 GHzOpteron x86-64 4 GB main memory 160 GB SATA hard-disks run-ning Java 160_12 on Debian 504

3 Experimental corpus

Later in this paper we discuss the performance and results ofapplying our methods over a corpus of 1118 billion quadruplesderived from an open-domain RDFXML crawl of 3985 millionweb documents in mid-May 2010 (detailed in [3229]) The crawlwas conducted in a breadth-first manner extracting URIs from allpositions of the RDF data Individual URI queues were assigned todifferent pay-level-domains (aka PLDs domains that require pay-ment eg deriie datagovuk) where we enforced a polite-ness policy of accessing a maximum of two URIs per PLD persecond URIs with the highest inlink count per each PLD queuewere polled first

With regards the resulting corpus of the 1118 billion quads1106 billion are unique and 947 million are unique triples Thedata contain 23 thousand unique predicates and 105 thousandunique class terms (terms in the object position of an rdftype

triple) In terms of diversity the corpus consists of RDF from 783pay-level-domains [32]

Note that further details about the parameters of the crawl andstatistics about the corpus are available in [3229]

Now we discuss the usage of terms in a data-level position vizterms in the subject position or object position of non-rdftypetriples11 Since we do not consider the consolidation of literals orschema-level concepts we focus on characterising blank-node andURI re-use in such data-level positions thus rendering a picture ofthe structure of the raw data

We found 2863 million unique terms of which 1654 million(578) were blank-nodes 921 million (322) were URIs and289 million (10) were literals With respect to literals each hadon average 9473 data-level occurrences (by definition all in theobject position)

With respect to blank-nodes each had on average 5233 data-level occurrences Each occurred on average 0995 times in the ob-ject position of a non-rdftype triple with 31 million (187) notoccurring in the object position conversely each occurred on aver-age 4239 times in the subject position of a triple with 69 thousand(004) not occurring in the subject position Thus we summarisethat almost all blank-nodes appear in both the subject position andobject position but occur most prevalently in the former Impor-tantly note that in our input blank-nodes cannot be re-used acrosssources

With respect to URIs each had on average 941 data-leveloccurrences (18 the average for blank-nodes) with 4399 aver-age appearances in the subject position and 501 appearances inthe object positionmdash1985 million (2155) did not appear in anobject position whilst 5791 million (6288) did not appear in asubject position

With respect to re-use across sources each URI had a data-leveloccurrence in on average 47 documents and 1008 PLDsmdash562million (6102) of URIs appeared in only one document and913 million (9913) only appeared in one PLD Also re-use of URIsacross documents was heavily weighted in favour of use in the ob-ject position URIs appeared in the subject position in on average1061 documents and 0346 PLDs for the object position of

84 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

non-rdftype triples URIs occurred in on average 3996 docu-ments and 0727 PLDs

The URI with the most data-level occurrences (166 million)was httpidentica the URI with the most re-use acrossdocuments (appearing in 1793 thousand documents) washttpcreativecommonsorglicensesby30 the URIwith the most re-use across PLDs (appearing in 80 different do-mains) was httpwwwldoddscomfoaffoaf-a-maticAlthough some URIs do enjoy widespread re-use across differentdocuments and domains in Figs 1 and 2 we give the distribu-tion of re-use of URIs across documents and across PLDs wherea powerndashlaw relationship is roughly evidentmdashagain the majorityof URIs only appear in one document (61) or in one PLD(99)

From this analysis we can conclude that with respect to data-level terms in our corpus

ndash blank-nodes which by their very nature cannot be re-usedacross documents are 18 more prevalent than URIs

ndash despite a smaller number of unique URIs each one is used in(probably coincidentally) 18 more triples

ndash unlike blank-nodes URIs commonly only appear in either a sub-ject position or an object position

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100 1000 10000 100000 1e+006

num

ber o

f UR

Is

number of documents mentioning URI

Fig 1 Distribution of URIs and the number of documents they appear in (in a data-position)

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100

num

ber o

f UR

Is

number of PLDs mentioning URI

Fig 2 Distribution of URIs and the number of PLDs they appear in (in a data-position)

ndash each URI is re-used on average in 47 documents but usuallyonly within the same domainmdashmost external re-use is in theobject position of a triple

ndash 99 of URIs appear in only one PLD

We can conclude that within our corpusmdashitself a general crawlfor RDFXML on the Webmdashwe find that there is only sparse re-use of data-level terms across sources and particularly acrossdomains

Finally for the purposes of demonstrating performance acrossvarying numbers of machines we extract a smaller corpus com-prising of 100 million quadruples from the full evaluation corpusWe extract the sub-corpus from the head of the raw corpus sincethe data are ordered by access time polling statements from thehead roughly emulates a smaller crawl of data which shouldensure eg that all well-linked vocabularies are containedtherein

4 Base-line consolidation

We now present the lsquolsquobase-linersquorsquo algorithm for consolidationthat consumes asserted owlsameAs relations in the dataLinked Data best-practices encourage the provision ofowlsameAs links between exporters that coin different URIsfor the same entities

lsquolsquoIt is common practice to use the owlsameAs property for statingthat another data source also provides information about a specificnon-information resourcersquorsquo [7]

We would thus expect there to be a significant amount of expli-cit owlsameAs data present in our corpusmdashprovided directly bythe Linked Data publishers themselvesmdashthat can be directly usedfor consolidation

To perform consolidation over such data the first step is toextract all explicit owlsameAs data from the corpus We mustthen compute the transitive and symmetric closure of the equiv-alence relation and build equivalence classes sets of coreferentidentifiers Note that the set of equivalence classes forms a parti-tion of coreferent identifiers where due to the transitive andsymmetric semantics of owlsameAs each identifier can only ap-pear in one such class Also note that we do not need to considersingleton equivalence classes that contain only one identifierconsolidation need not perform any action if an identifier isfound only to be coreferent with itself Once the closure of as-serted owlsameAs data has been computed we then need tobuild an index that maps identifiers to the equivalence class inwhich it is contained This index enables lookups of coreferentidentifiers Next in the consolidated data we would like to col-lapse the data mentioning identifiers in each equivalence classto instead use a single consistent canonical identifier for eachequivalence class we must thus choose a canonical identifier Fi-nally we can scan the entire corpus and using the equivalenceclass index rewrite the original identifiers to their canonincalform As an optional step we can also add links between eachcanonical identifier and its coreferent forms using an artificalowlsameAs relation in the output thus persisting the originalidentifiers

The distributed algorithm presented in this section is basedon an in-memory equivalence class closure and index and isthe current method of consolidation employed by the SWSE sys-tem described previously in [32] Herein we briefly reintroducethe approach from [32] where we also add new performanceevaluation over varying numbers of machines in the distributedsetup present more discussion and analysis of the use ofowlsameAs in our Linked Data corpus and manually evaluatethe precision of the approach for an extended sample of one

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 85

thousand coreferent pairs In particular this section serves as abaseline for comparison against the extended consolidation ap-proach that uses richer reasoning features explored later inSection 5

41 High-level approach

Based on the previous discussion the approach is straight-forward

(i) scan the corpus and separate out all asserted owlsameAs

relations from the main body of the corpus(ii) load these relations into an in-memory index that encodes

the transitive and symmetric semantics of owlsameAs(iii) for each equivalence class in the index choose a canonical

term(iv) scan the corpus again canonicalising any term in the

subject position or object position of an rdftype

triple

Thus we need only index a small subset of the corpusmdashowlsameAs statementsmdashand can apply consolidation by meansof two scans

The non-trivial aspects of the algorithm are given by theequality closure and index To perform the in-memory transi-tive symmetric closure we use a traditional unionndashfind algo-rithm [6640] for computing equivalence partitions where (i)equivalent elements are stored in common sets such that eachelement is only contained in one set (ii) a map provides a find

function for looking up which set an element belongs to and(iii) when new equivalent elements are found the sets they be-long to are unioned The process is based on an in-memory mapindex and is detailed in Algorithm 1 where

(i) the eqc labels refer to equivalence classes and s and o refer toRDF subject and object terms

(ii) the function mapget refers to an identifier lookup on the in-memory index that should return the intermediary equiva-lence class associated with that identifier (ie find)

(iii) the function mapput associates an identifier with a newequivalence class

The output of the algorithm is an in-memory map from identi-fiers to their respective equivalence class

Next we must choose a canonical term for each equivalenceclass we prefer URIs over blank-nodes thereafter choosing a termwith the lowest alphabetical ordering a canonical term is thusassociated to each equivalent set Once the equivalence index hasbeen finalised we re-scan the corpus and canonicalise the datausing the in-memory index to service lookups

42 Distributed approach

Again distribution of the approach is fairly intuitive as follows

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract owlsameAs relations

(ii) Gather gather all owlsameAs relations onto the mastermachine and build the in-memory equality index

(iii) Floodrun send the equality index (in its entirety) to eachslave machine and apply the consolidation scan in parallel

As we will see in the next section the most expensive meth-odsmdashinvolving the two scans of the main corpusmdashcan be con-ducted in parallel

Algorithm 1

Building equivalence map [32]

Require SAMEAS DATA SA1 map 2 for (sowlsameAso) 2 SA ^ s ndash o do3 eqcs mapget(s)4 eqco mapget(o)5 if eqcs = ^ eqco = then6 eqcs[o so7 mapput(seqcs[o)8 mapput(oeqcs[o)9 else if eqcs = then

10 add s to eqco

11 mapput(seqco)12 else if eqco = then13 add o to eqcs

14 mapput(oeqcs)15 else if eqcs ndash eqco then16 if jeqcsj gt jeqcoj17 add all eqco into eqcs

18 for eo 2 eqco do19 mapput(eoeqcs)20 end for21 else22 add all eqcs into eqco

23 for es 2 eqcs do24 mapput(eseqco)25 end for26 end if27 end if28 end for

43 Performance evaluation

We applied the distributed base-line consolidation over our cor-pus with the aformentioned procedure and setup The entire con-solidation process took 633 min with the bulk of time taken asfollows the first scan extracting owlsameAs statements took125 min with an average idle time for the servers of 11 s(14)mdashie on average the slave machines spent 14 of the timeidly waiting for peers to finish Transferring aggregating and load-ing the owlsameAs statements on the master machine took84 min The second scan rewriting the data according to thecanonical identifiers took in total 423 min with an average idletime of 647 s (25) for each machine at the end of the roundThe slower time for the second round is attributable to the extraoverhead of re-writing the GZip-compressed data to disk as op-posed to just reading

In the rightmost column of Table 5 (full-8) we give a break-down of the timing for the tasks over the full corpus using eightslave machines and one master machine Independent of the num-ber of slaves we note that the master machine required 85 min forcoordinating globally-required owlsameAs knowledge and thatthe rest of the task time is spent in embarrassingly parallel execu-tion (amenable to reduction by increasing the number of ma-chines) For our setup the slave machines were kept busy for onaverage 846 of the total task time of the idle time 87 wasspent waiting for the master to coordinate the owlsameAs dataand 13 was spent waiting for peers to finish their task due tosub-optimal load balancing The master machine spent 866 ofthe task idle waiting for the slaves to finish

In addition Table 5 also gives performance results for onemaster and varying numbers of slave machines (1248) over the

Table 5Breakdown of timing of distributed baseline consolidation

Category 1 2 4 8 Full-8

Min Min Min min Min

MasterTotal execution time 422 1000 223 1000 119 1000 67 1000 633 1000Executing 05 11 05 22 05 40 05 75 85 134

Aggregating owl sameAs 04 11 05 22 05 40 05 75 84 133Miscellaneous 01 01 01 04 01 01

Idle (waiting for slaves) 418 989 218 977 114 958 62 921 548 866

SlavesAvg Executing (total) 418 989 218 977 114 958 62 921 535 846

Extract owl sameAs 90 214 45 202 22 184 11 168 123 195Consolidate 327 775 171 769 88 735 47 705 412 651

Avg Idle 05 11 07 29 10 82 08 127 98 154Waiting for peers 00 00 01 07 05 40 03 47 13 20Waiting for master 05 11 05 23 05 42 05 79 85 134

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000

num

ber o

f cla

sses

equivalence class size

Fig 3 Distribution of sizes of equivalence classes on loglog scale (from [32])

1

10

100

1000

10000

100000

1e+006

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

Fig 4 Distribution of the number of PLDs per equivalence class on loglog scale

2 For example see the coreference results given by httpwwwrkbexplorercoma m e A s u r i = h t t p w w w a c m r k b e x p l o r e r c o m i d p e r s o n - 5 3 2 9 2 -2877d02973d0d01e8f29c7113776e7e which at the time of writing correspond to36 out of the 443 equivalent URIs found for Dieter Fensel

86 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

100 million quadruple corpus Note that in the table we use thesymbol to indicate a negligible value (0 lt lt 005) We see thatas the number of slave machines increases so too does the percent-age of time taken to aggregate global knowledge on the master ma-chine (the absolute time stays roughly stable) This indicates thatfor a high number of machines the aggregation of global knowl-edge will eventually become a bottleneck and an alternative distri-bution strategy may be preferable subject to further investigationOverall however we see that the total execution times roughlyhalve when the number of slave machines are doubled we observea 0528 0536 and 0561 reduction in total task executiontime moving from 1 slave machine to 2 2 to 4 and 4 to 8 respec-tively (here a value of 05 would indicate linear scaling out)

44 Results evaluation

We extracted 1193 million raw owlsameAs quadruples fromthe full corpus Of these however there were only 377 million un-ique triples After closure the data formed 216 million equivalenceclasses mentioning 575 million terms (624 of URIs)mdashan averageof 265 elements per equivalence class Of the 575 million termsonly 4156 were blank-nodes Fig 3 presents the distribution ofsizes of the equivalence classes where the largest equivalenceclass contains 8481 equivalent entities and 16 million (741)equivalence classes contain the minimum two equivalent identifi-ers Fig 4 shows a similar distribution but for the number of PLDsextracted from URIs in each equivalence class Interestingly themajority (571) of equivalence classes contained identifiers from

more than one PLD where the most diverse equivalence class con-tained URIs from 32 PLDs

Table 6 shows the canonical URIs for the largest five equiva-lence classes we manually inspected the results and showwhether or not the results were verified as correctincorrect In-deed results for class 1 and 2 were deemed incorrect due toover-use of owlsameAs for linking drug-related entities in theDailyMed and LinkedCT exporters Results 3 and 5 were verifiedas correct consolidation of prominent Semantic Web relatedauthors resp Dieter Fensel and Rudi Studermdashauthors are gi-ven many duplicate URIs by the RKBExplorer coreference index12

Result 4 contained URIs from various sites generally referring tothe United States mostly from DBPedia and LastFM With respectto the DBPedia URIs these (i) were equivalent but for capitilisationvariations or stop-words (ii) were variations of abbreviations or va-lid synonyms (iii) were different language versions (eg dbpediaEtats_Unis) (iv) were nicknames (eg dbpediaYankee_land)(v) were related but not equivalent (eg dbpediaAmerican_Civilization) (vi) were just noise (eg dbpediaLOL_Dean)

Besides the largest equivalence classes which we have seen areprone to errors perhaps due to the snowballing effect of the tran-sitive and symmetric closure we also manually evaluated a ran-dom sample of results For the sampling we take the closed

1

s24

Table 6Largest five equivalence classes (from [32])

Canonical term (lexically first in equivalence class) Size Correct

1 httpbio2rdforgdailymed_drugs1000 84812 httpbio2rdforgdailymed_drugs1042 8003 httpacmrkbexplorercomidperson-53292-22877d02973d0d01e8f29c7113776e7e 443 U

4 httpagame2teachcomddb61cae0e083f705f65944cc3bb3968ce3f3ab59-ge_1 353 U5 httpwwwacmrkbexplorercomidperson-236166-1b4ef5fdf4a5216256064c45a8923bc9 316 U

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 87

equivalence classes extracted from the full corpus pick an identi-fier (possibly non-canonical) at random and then randomly pickanother coreferent identifier from its equivalence class We thenretrieve the data associated for both identifiers and manually in-spect them to see if they are in fact equivalent For each pairwe then select one of four options SAME DIFFERENT UNCLEAR TRIVIALLY

SAME We used the UNCLEAR option sparingly and only where therewas not enough information to make an informed decision or fordifficult subjective choices note that where there was not enoughinformation in our corpus we tried to dereference more informa-tion to minimise use of UNCLEAR in terms of difficult subjectivechoices one example pair we marked unclear was dbpediaFolk-core and dbpediaExperimental_folk We used the TRIVIALLY

SAME option to indicate that an equivalence is purely lsquolsquosyntacticrsquorsquowhere one identifier has no information other than being the targetof a owlsameAs link although the identifiers are still coreferentwe distinguish this case since the equivalence is not so lsquolsquomeaning-fulrsquorsquo For the 1000 pairs we choose option TRIVIALLY SAME 661 times(661) SAME 301 times (301) DIFFERENT 28 times (28) and UN-

CLEAR 10 times (1) In summary our manual verification of thebaseline results puts the precision at 972 albeit with manysyntatic equivalences found Of those found to be incorrect manywere closely-related DBpedia resources that were linked to by thesame resource on the Freebase exporter an example of this wasdbpediaRock_Hudson and dbpediaRock_Hudson_filmogra-phy having owlsameAs links from the same Freebase conceptfbrock_hudson

Moving on in Table 7 we give the most frequently co-occurringPLD-pairs in our equivalence classes where datasets resident onthese domains are lsquolsquoheavilyrsquorsquo interlinked with owl sameAs rela-tions We italicise the indexes of inter-domain links that were ob-served to often be purely syntactic

With respect to consolidation identifiers in 786 million subjectpositions (7 of subject positions) and 232 million non-rdftype-object positions (26) were rewritten giving a total of1019 million positions rewritten (51 of total rewritable posi-tions) The average number of documents mentioning each URIrose slightly from 4691 to 4719 (a 06 increase) due to consoli-dation and the average number of PLDs also rose slightly from1005 to 1007 (a 02 increase)

Table 7Top 10 PLD pairs co-occurring in the equivalence classes with number of equivalenceclasses they co-occur in

PLD PLD Co-occur

1 rdfizecom uriburnercom 5066232 dbpediaorg freebasecom 1870543 bio2rdforg purlorg 1853924 locgov infolcauthorities

a 1666505 l3sde rkbexplorercom 1258426 bibsonomyorg l3sde 1258327 bibsonomyorg rkbexplorercom 1258308 dbpediaorg mpiide 998279 freebasecom mpiide 8961010 dbpediaorg umbelorg 65565

a In fact these were within the same PLD

5 Extended reasoning consolidation

We now look at extending the baseline approach to includemore expressive reasoning capabilities in particular using OWL 2RLRDF rules (introduced previously in Section 26) to infer novelowlsameAs relations that can then be used for consolidationalongside the explicit relations asserted by publishers

As described in Section 22 OWL provides various featuresmdashincluding functional properties inverse-functional properties andcertain cardinality restrictionsmdashthat allow for inferring novelowlsameAs data Such features could ease the burden on publish-ers of providing explicit owlsameAs mappings to coreferent iden-tifiers in remote naming schemes For example inverse-functionalproperties can be used in conjunction with legacy identificationschemesmdashsuch as ISBNs for books EAN UCC-13 or MPN for prod-ucts MAC addresses for network-enabled devices etcmdashto boot-strap identity on the Web of Data within certain domains suchidentification values can be encoded as simple datatype stringsthus bypassing the requirement for bespoke agreement or map-pings between URIs Also lsquolsquoinformation resourcesrsquorsquo with indigenousURIs can be used for indirectly identifying related resources towhich they have a functional mapping where examples includepersonal email-addresses personal homepages WIKIPEDIA articlesetc

Although Linked Data literature has not explicitly endorsed orencouraged such usage prominent grass-roots efforts publishingRDF on the Web rely (or have relied) on inverse-functional proper-ties for maintaining consistent identity For example in the FriendOf A Friend (FOAF) community a technique called smushing wasproposed to leverage such properties for identity serving as anearly precursor to methods described herein13

Versus the baseline consolidation approach we must first ap-ply some reasoning techniques to infer novel owlsameAs rela-tions Once we have the inferred owlsameAs data materialisedand merged with the asserted owlsameAs relations we can thenapply the same consolidation approach as per the baseline (i) ap-ply the transitive symmetric closure and generate the equivalenceclasses (ii) pick canonical identifiers for each class and (iii) re-write the corpus However in contrast to the baseline we expectthe extended volume of owlsameAs data to make in-memorystorage infeasible14 Thus in this section we avoid the need tostore equivalence information in-memory where we instead inves-tigate batch-processing methods for applying an on-disk closure ofthe equivalence relations and likewise on-disk methods for rewrit-ing data

Early versions of the local batch-processing [31] techniques andsome of the distributed reasoning techniques [33] are borrowedfrom previous works Herein we reformulate and combine thesetechniques into a complete solution for applying enhanced consol-idation in a distributed scalable manner over large-scale Linked

13 cf httpwikifoaf-projectorgwSmushing retr 2011012214 Further note that since all machines must have all owlsameAs information

adding more machines does not increase the capacity for handling more suchrelations the scalability of the baseline approach is limited by the machine with theleast in-memory capacity

88 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Data corpora We provide new evaluation over our 1118 billionquadruple evaluation corpus presenting new performance resultsover the full corpus and for a varying number of slave machinesWe also analyse in depth the results of applying the techniquesover our evaluation corpus analysing the applicability of ourmethods in such scenarios We contrast the results given by thebaseline and extended consolidation approaches and again manu-ally evaluate the results of the extended consolidation approach for1000 pairs found to be coreferent

51 High-level approach

In Table 2 we provide the pertinent rules for inferring new owl

sameAs relations from the data However after analysis of thedata we observed that no documents used the owlmaxQuali-

fiedCardinality construct required for the cls-maxqc rulesand that only one document defined one owlhasKey axiom15

involving properties with less than five occurrences in the datamdashhence we leave implementation of these rules for future work andnote that these new OWL 2 constructs have probably not yet had timeto find proper traction on the Web Thus on top of inferencing involv-ing explicit owlsameAs we are left with rule prp-fp which supportsthe semantics of properties typed owlFunctionalProperty andrule prp-ifp which supports the semantics of properties typedowlInverseFunctionalProperty and rule cls-maxc1 whichsupports the semantics of classes with a specified cardinality of 1for some defined property (a class restricted version of the func-tional-property inferencing)16

Thus we look at using OWL 2 RLRDF rules prp-fp prp-ifp andcls-maxc2 for inferring new owl sameAs relations between indi-vidualsmdashwe also support an additional rule that gives an exact car-dinality version of cls-maxc217 Herein we refer to these rules asconsolidation rules

We also investigate pre-applying more general OWL 2 RLRDFreasoning over the corpus to derive more complete results wherethe ruleset is available in Table 3 and is restricted to OWL 2 RLRDFrules with one assertional pattern [33] unlike the consolidationrules these general rules do not require the computation of asser-tional joins and thus are amenable to execution by means of a sin-gle-scan of the corpus We have demonstrated this profile ofreasoning to have good competency with respect to the features ofRDFS and OWL used in Linked Data [29] These general rules mayindirectly lead to the inference of additional owlsameAs relationsparticularly when combined with the consolidation rules of Table 2

Once we have used reasoning to derive novel owlsameAs

data we then apply an on-disk batch processing technique toclose the equivalence relation and to ultimately consolidate thecorpus

Thus our high-level approach is as follows

(i) extract relevant terminological data from the corpus(ii) bind the terminological patterns in the rules from this data

thus creating a larger set of general rules with only oneassertional pattern and identifying assertional patterns thatare useful for consolidation

(iii) apply general-rules over the corpus and buffer any inputinferred statements relevant for consolidation to a new file

(iv) derive the closure of owlsameAs statements from the con-solidation-relevant dataset

15 httphuemerlstadlernetrolerhrdf16 Note that we have presented non-distributed batch-processing execution of

these rules at a smaller scale (147 million quadruples) in [31]17 Exact cardinalities are disallowed in OWL 2 RL due to their effect on the formal

proposition of completeness underlying the profile but such considerations are mootin our scenario

(v) apply consolidation over the main corpus with respect to theclosed owlsameAs data

In Step (i) we extract terminological datamdashrequired for applica-tion of our rulesmdashfrom the main corpus As discussed in Section 25to help ensure robustness we apply authoritative reasoning inother words at this stage we discard any third-party terminologi-cal statements that affect inferencing over instance data for remoteclasses and properties

In Step (ii) we then compile the authoritative terminologicalstatements into the rules to generate a set of T-ground (purelyassertional) rules which can then be applied over the entirety ofthe corpus [33] As mentioned in Section 26 the T-ground generalrules only contain one assertional pattern and thus are amenableto execution by a single scan eg given the statement

homepage rdfssubPropertyOf isPrimaryTopicOf

we would generate a T-ground rulex homepage y

)x isPrimaryTopicOf y

which does not require the computation of joins We also bindthe terminological patterns of the consolidation rules in Table 2however these rules cannot be performed by means of a singlescan over the data since they contain multiple assertional pat-terns in the body Thus we instead extract triple patternssuch as

x isPrimaryTopicOf y

which may indicate data useful for consolidation (note that hereisPrimaryTopicOf is assumed to be inverse-functional)

In Step (iii) we then apply the T-ground general rules over thecorpus with a single scan Any input or inferred statements match-ing a consolidation-relevant patternmdashincluding any owlsameAs

statements foundmdashare buffered to a file for later join computationSubsequently in Step (iv) we must now compute the on-disk

canonicalised closure of the owlsameAs statements In particularwe mainly employ the following three on-disk primitives

(i) sequential scans of flat files containing line-delimitedtuples18

(ii) external-sorts where batches of statements are sorted inmemory the sorted batches written to disk and the sortedbatches merged to the final output and

(iii) merge-joins where multiple sets of data are sorted accordingto their required join position and subsequently scanned inan interleaving manner that aligns on the join position andwhere an in-memory join is applied for each individual joinelement

Using these primitives to compute the owlsameAs closureminimises the amount of main memory required where we havepresented similar approaches in [3031]

First assertional memberships of functional properties inverse-functional properties and cardinality restrictions (both propertiesand classes) are written to separate on-disk files For functional-property and cardinality reasoning a consistent join variable forthe assertional patterns is given by the subject position for in-verse-functional-property reasoning a join variable is given bythe object position19 Thus we can sort the former sets of data

8 These files are G-Zip compressed flat files of N-Triple-like syntax encodingrbitrary length tuples of RDF constants9 Although a predicate-position join is also available we prefer object-positionins that provide smaller batches of data for the in-memory join

1

a1

jo

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 89

according to subject and perform a merge-join by means of a linearscan thereafter the same procedure applies to the latter set sortingand merge-joining on the object position Applying merge-join scanswe produce new owlsameAs statements

Both the originally asserted and newly inferred owlsameAs

relations are similarly written to an on-disk file over which wenow wish to perform the canonicalised symmetrictransitive clo-sure We apply a similar method again leveraging external sortsand merge-joins to perform the computation (herein we sketchand point the interested reader to [31 Section 46]) In the follow-ing we use gt and lt to denote lexical ordering SA as a shortcut forowlsameAs a b c etc to denote members of U [ B such thata lt b lt c and define URIs to be lexically lower than blank nodesThe process is as follows

(i) we only materialise symmetric equality relations thatinvolve a (possibly intermediary) canonical term chosenby a lexical ordering given b SA a we materialise a SA bgiven a SA b SA c we materialise the relations a SA b aSA c and their inverses but do not materialise b SA c orits inverse

(ii) transitivity is supported by iterative merge-join scans

20 One could instead build an on-disk map for equivalence classes and pivoelements and follow a consolidation method similar to the previous section over theunordered data however we would expect such an on-disk index to have a low cachehit-rate given the nature of the data which would lead to many (slow) disk seeks Analternative approach might be to hash-partition the corpus and equality index overdifferent machines however this would require a non-trivial minimum amount omemory on the cluster

ndash in the scan if we find c SA a (sorted by object) and c SA d(sorted naturally) we infer a SA d and drop the non-canonical c SA d (and d SA c)

ndash at the end of the scan newly inferred triples are marked and merge-joined

into the main equality data any triples echoing an earlier inference are ignored dropped non-canonical statements are removed

ndash the process is then iterative in the next scan if we find dSA a and d SA e we infer a SA e and e SA a

ndash inferences will only occur if they involve a statementadded in the previous iteration ensuring that inferencesteps are not re-computed and that the computation willterminate

(iii) the above iterations stop when a fixpoint is reached andnothing new is inferred

(iv) the process of reaching a fixpoint is accelerated using avail-able main-memory to store a cache of partial equalitychains

The above steps follow well-known methods for transtive clo-sure computation modified to support the symmetry ofowlsameAs and to use adaptive canonicalisation in order toavoid quadratic output With respect to the last item we useAlgorithm 1 to derive lsquolsquobatchesrsquorsquo of in-memory equivalencesand when in-memory capacity is achieved we write thesebatches to disk and proceed with on-disk computation this isparticularly useful for computing the small number of longequality chains which would otherwise require sorts andmerge-joins over all of the canonical owlsameAs data currentlyderived and where the number of iterations would otherwise bethe length of the longest chain The result of this process is a setof canonicalised equality relations representing the symmetrictransitive closure

Next we briefly describe the process of canonicalising data withrespect to this on-disk equality closure where we again use exter-nal-sorts and merge-joins First we prune the owlsameAs indexto only maintain relations s1 SA s2 such that s1 gt s2 thus given s1

SA s2 we know that s2 is the canonical identifier and s1 is to berewritten We then sort the data according to the position thatwe wish to rewrite and perform a merge-join over both sets ofdata buffering the canonicalised data to an output file If we wantto rewrite multiple positions of a file of tuples (eg subject and

object) we must rewrite one position sort the results by the sec-ond position and then rewrite the second position20

Finally note that in the derivation of owlsameAs from the con-solidation rules prp-fp prp-ifp cax-maxc2 the overall processmay be iterative For example consider

dblpAxel foafisPrimaryTopicOf lthttppolleresnetgt

axelme foafisPrimaryTopicOf lthttpaxelderiiegt

lthttppolleresnetgt owlsameAs lthttpaxelderiiegt

from which the conclusion that dblpAxel is the same asaxelme holds We see that new owlsameAs relations (either as-serted or derived from the consolidation rules) may in turn lsquolsquoalignrsquorsquoterms in the join position of the consolidation rules leading to newequivalences Thus for deriving the final owlsameAs we require ahigher-level iterative process as follows

(i) initially apply the consolidation rules and append theresults to a file alongside the asserted owl sameAs state-ments found

(ii) apply the initial closure of the owlsameAs data(iii) then iteratively until no new owlsameAs inferences are

found

ndash rewrite the join positions of the on-disk files containing

the data for each consolidation rule according to the cur-rent owlsameAs data

ndash derive new owlsameAs inferences possible through theprevious rewriting for each consolidation rule

ndash re-derive the closure of the owlsameAs data includingthe new inferences

The final closed file of owlsameAs data can then be re-used torewrite the main corpus in two sorts and merge-join scans oversubject and object

52 Distributed approach

The distributed approach follows quite naturally from theprevious discussion As before we assume that the input dataare evenly pre-distributed over the slave machines (in anyarbitrary ordering) where we can then apply the followingprocess

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract relevant terminologicalknowledge

(ii) Gather gather terminological data onto the master machineand thereafter bind the terminological patterns of the gen-eralconsolidation rules

(iii) Flood flood the rules for reasoning and the consolidation-relevant patterns to all slave machines

(iv) Run apply reasoning and extract consolidation-relevantstatements from the input and inferred data

(v) Gather gather all consolidation statements onto the mastermachine then in parallel

t

f

Table 8Breakdown of timing of distributed extended consolidation wreasoning where identifies the masterslave tasks that run concurrently

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 4543 1000 2302 1000 1275 1000 813 1000 7404 1000Executing 299 66 303 131 325 255 301 370 2436 329

Aggregate Consolidation Relevant Data 43 09 45 20 47 37 46 57 299 40Closing owlsameAs

244 54 245 106 254 199 244 300 2112 285Miscellaneous 12 03 13 05 24 19 11 13 25 03

Idle (waiting for slaves) 4245 934 1999 869 950 745 512 630 4968 671

SlaveAvg Executing (total) 4489 988 2212 961 1116 875 565 694 5991 809

Extract Terminology 444 98 227 99 118 92 69 84 494 67Extract Consolidation Relevant Data 955 210 483 210 243 191 132 162 1379 186Initial Sort (by subject) 980 216 483 210 237 186 116 142 1338 181Consolidation 2110 464 1019 443 519 407 249 306 2780 375

Avg Idle 55 12 89 39 379 297 236 290 1413 191Waiting for peers 00 00 32 14 71 56 63 77 375 51Waiting for master 55 12 58 25 308 241 173 212 1038 140

90 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

ndash Local compute the closure of the consolidation rules andthe owlsameAs data on the master machine

ndash Run each slave machine sorts its fragment of the maincorpus according to natural order (spoc)

1 At least in terms of pure quantity However we do not give an indication of theuality or importance of those few equivalences we miss with this approximationhich may well be application specific

(vi) Flood send the closed owlsameAs data to the slavemachines once the distributed sort has been completed

(vii) Run each slave machine then rewrites the subjects of theirsegment of the corpus subsequently sorts the rewritten databy object and then rewrites the objects (of non-rdftypetriples) with respect to the closed owlsameAs data

53 Performance evaluation

Applying the above process to our 1118 billion quadruple corpustook 1234 h Extracting the terminological data took 114 h with anaverage idle time of 19 min (277) (one machine took 18 minlonger than the rest due to processing a large ontology containing2 million quadruples [32]) Merging and aggregating the termino-logical data took roughly 1 min Applying the reasoning andextracting the consolidation relevant statements took 234 h withan average idle time of 25 min (18) Aggregating and mergingthe consolidation relevant statements took 299 min Thereafterlocally computing the closure of the consolidation rules and theequality data took 352 h with the computation requiring two itera-tions overall (the minimum possiblemdashthe second iteration did notproduce any new results) Concurrent to the previous step the paral-lel sort of remote data by natural order took 233 h with an averageidle time of 6 min (43) Subsequent parallel consolidation of thedata took 48 h with 10 min (35) average idle timemdashof this 19of the time was spent consolidating the pre-sorted subjects 60of the time was spent sorting the rewritten data by object and21 of the time was spent consolidating the objects of the data

As before Table 8 summarises the timing of the task Focusing onthe performance for the entire corpus the master machine took406 h to coordinate global knowledge constituting the lower boundon time possible for the task to execute with respect to increasingmachines in our setupmdashin future it may be worthwhile to investigatedistributed strategies for computing the owlsameAs closure(which takes 285 of the total computation time) but for the mo-ment we mitigate the cost by concurrently running a sort on theslave machines thus keeping the slaves busy for 634 of the timetaken for this local aggregation step The slave machines were onaverage busy for 809 of the total task time of the idle time733 was spent waiting for the master machine to aggregate theconsolidation relevant data and to finish the closure of owl sameAs

data and the balance (267) was spent waiting for peers to finish(mostly during the extraction of terminological data)

With regards performance evaluation over the 100 million qua-druple corpus for a varying number of slave machines perhapsmost notably the average idle time of the slave machines increasesquite rapidly as the number of machines increases In particular by4 machines the slaves can perform the distributed sort-by-subjecttask faster than the master machine can perform the local owlsameAs closure The significant workload of the master machinewill eventually become a bottleneck as the number of slave ma-chines increases further This is also noticeable in terms of overalltask time the respective time savings as the number of slave ma-chines doubles (moving from 1 slave machine to 8) are 0507 0554 and 0638 respectively

Briefly we also ran the consolidation without the general rea-soning rules (Table 3) motivated earlier With respect to perfor-mance the main variations were given by (i) the extraction ofconsolidation relevant statementsmdashthis time directly extractedfrom explicit statements as opposed to explicit and inferred state-mentsmdashwhich took 154 min (11 of the time taken including thegeneral reasoning) with an average idle time of less than one min-ute (6 average idle time) (ii) local aggregation of the consolida-tion relevant statements took 17 min (569 of the time takenpreviously) (iii) local closure of the owlsameAs data took318 h (904 of the time taken previously) The total time savedequated to 28h (227) where 333 min were saved from coordi-nation on the master machine and 225 h were saved from parallelexecution on the slave machines

54 Results evaluation

In this section we present the results of the consolidationwhich included the general reasoning step in the extraction ofthe consolidation-relevant statements In fact we found that theonly major variation between the two approaches was in theamount of consolidation-relevant statements collected (discussedpresently) where the changes in the total amounts of coreferentidentifiers equivalence classes and canonicalised terms were infact negligible (lt01) Thus for our corpus extracting only as-serted consolidation-relevant statements offered a very closeapproximation of the extended reasoning approach21

2

qw

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 4: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 79

Interestingly OWL also contains other features (on a termino-logical level) that allow for inferring new owlsameAs relationsSuch features include

ndash inverse-functional properties which act as lsquolsquokey propertiesrsquorsquo foruniquely identifying subjects eg an isbn datatype-propertywhose values uniquely identifies books and other documents orthe object-property biologicalMotherOf which uniquelyidentifies the biological mother of a particular child

ndash functional properties which work in the inverse direction anduniquely identify objects eg the property hasBiological

Motherndash cardinality and max-cardinality restrictions which when given

a value of 1 uniquely identify subjects of a given class eg thevalue for the property spouse uniquely identifies a member ofthe class Monogamist

Thus we see that RDFS and OWL contain various features thatallow for inferring and supporting coreference between individu-als In order to support the semantics of such features we applyinference rules introduced next

23 Inference rules

Herein we formalise the notion of an inference rule as applicableover RDF graphs [33] We begin by defining the preliminary con-cept of a triple pattern of which rules are composed and whichmay contain variables in any position

Triple pattern basic graph pattern A triple pattern is a generalisedtriple where variables from the set V are allowed ie tv = (svpvov)2GV GV = (C [ V) (C [ V) (C [ V) We call a set (to be read asconjunction) of triple patterns GV GV a basic graph pattern We de-note the set of variables in graph pattern GV by VethGVTHORN

Intuitively variables appearing in triple patterns can be boundby any RDF term in C Such a mapping from variables to constantsis called a variable binding

Variable bindings Let X be the set of variable bindingV [ C V [ C that map every constant c 2 C to itself and every var-iable v 2 V to an element of the set C [ V A triple t is a binding of atriple pattern tv = (svpvov) iff there exists l 2X such thatt = l(tv) = (l(sv) l(pv) l(ov)) A graph G is a binding of a graph pat-tern GV iff there exists a mapping l 2X such that

Stv2GVlethtv THORN frac14 G

we use the shorthand lethGVTHORN frac14 G We use XethGVGTHORN frac14 fl j lethGVTHORNGlethvTHORN frac14 v if v R VethGVTHORNg to denote the set of variable bindingmappings for graph pattern GV in graph G that map variables out-side GV to themselves

We can now give the notion of an inference rule which is com-prised of a pair of basic graph patterns that form an lsquolsquoIF) THENrsquorsquological structure

Inference rule We define an inference rule (or often just rule) as apair r frac14 ethAnter ConrTHORN where the antecedent (or body) Anter GV

and the consequent (or head) Conr GV are basic graph patternssuch that all variables in the consequent appear in the antecedentVethConrTHORN VethAnterTHORN We write inference rules as Anter ) Conr

Finally rules allow for applying inference through rule applica-tions where the rule body is used to generate variable bindingsagainst a given RDF graph and where those bindings are appliedon the head to generate new triples that graph entails

Rule application and standard closure A rule application is theimmediate consequences TrethGTHORN frac14

Sl2XethAnter GTHORNethlethConrTHORN n lethAnterTHORNTHORN

of a rule r on a graph G accordingly for a ruleset RTRethGTHORN frac14

Sr2RTrethGTHORN Now let Githorn1 frac14 Gi [ TRethGiTHORN and G0 frac14 G the

exhaustive application of the TR operator on a graph G is then theleast fixpoint (the smallest value for n) such that Gn frac14 TRethGnTHORNWe call Gn the closure of G wrt ruleset R denoted as ClRethGTHORN or suc-cinctly G where the ruleset is obvious

The above closure takes a graph and a ruleset and recursivelyapplies the rules over the union of the original graph and the infer-ences until a fixpoint

24 T-split rules

In [3133] we formalised an optimisation based on a separationof terminological knowledge from assertional data when applyingrules Similar optimisations have also been used by other authorsto enable large-scale rule-based inferencing [747143] This opti-misation is based on the premise that in Linked Data (and variousother scenarios) terminological data speaking about classes andproperties is much smaller than assertional data speaking aboutindividuals Previous observations indicate that for a Linked Datacorpus in the order of a billion triples such terminological datacomprises of about 01 of the total data volume Additionally ter-minological data is very frequently accessed during reasoningHence seperating and storing the terminological data in memoryallows for more efficient execution of rules albeit at the cost ofpossible incompleteness [7433]

We now reintroduce some of the concepts relating to separatingterminological information [33] which we will use later whenapplying rule-based reasoning to support the semantics of OWLequality We begin by formalising some related concepts that helpdefine our notion of terminological information

Meta-class We consider a meta-class as a class whose membersare themselves (always) classes or properties Herein we restrictour notion of meta-classes to the set defined in RDF(S) andOWL specifications where examples include rdfPropertyrdfsClass owlDatatypeProperty owlFunctionalProp-

erty etc Note that eg owlThing and rdfsResource are notmeta-classes since not all of their members are classes or properties

Meta-property A meta-property is one that has a meta-class asits domain again we restrict our notion of meta-properties tothe set defined in RDF(S) and OWL specifications where examplesinclude rdfsdomain rdfsrange rdfssubClassOfowlhasKey owlinverseOf owloneOf owlonPropertyowlunionOf etc Note that rdftype owlsameAs rdfslabeleg do not have a meta-class as domain and so are not consideredas meta-properties

Terminological triple We define the set of terminological triplesT G as the union of

(i) triples with rdftype as predicate and a meta-class asobject

(ii) triples with a meta-property as predicate(iii) triples forming a valid RDF list whose head is the object of a

meta-property (eg a list used for owlunionOf etc)

The above concepts cover what we mean by terminological datain the context of RDFS and OWL Next we define the notion of tri-ple patterns that distinguish between these different categories ofdata

Terminologicalassertional pattern We refer to a terminological-triple-graph pattern as one whose instance can only be a termino-logical triple or resp a set thereof An assertional pattern is anypattern that is not terminological

Given the above notions of terminological datapatterns we cannow define the notion of a T-split inference rule that distinguishesterminological from assertional information

T-split inference rule Given a rule r frac14 ethAnter Conr) we define aT-split rule rs as the triple (AnteT

rs AnteGrs Con) where AnteT

rs isthe set of terminological patterns in Anter and AnteG

rs frac14Anter nAnteT

rs The body of T-split rules are divided into a set of patterns that

apply over terminological knowledge and a set of patterns that

80 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

apply over assertional (ie any) knowledge Such rules enable anoptimisation whereby terminological patterns are pre-bound togenerate a new larger set of purely assertional rules called T-ground rules We illustrate this with an example

Example Let Rprp-ifp denote the following rule

(p a owl InverseFunctionalProperty)(x1 p y)(x2 p y))(x1 owl sameAs x2)

When writing T-split rules we denote terminological patternsin the body by underlining Also we use lsquoarsquo as a convenient short-cut for rdftype which indicates class-membership Now take theterminological triples

(isbn a owlInverseFunctionalProperty)(mbox a owlInverseFunctionalProperty)

From these two triples we can generate two T-ground rules asfollows

(x1 isbn y)(x2 isbn y))(x1 owl sameAs x2)

(x1 mbox y)(x2 mbox y))(x1 owl sameAs x2)

These T-ground rules encode the terminological OWL knowledgegiven by the previous two triples and can be applied directly overthe assertional data eg describing specific books with ISBNnumbers

25 Authoritative T-split rules

Caution must be exercised when applying reasoning over arbi-trary data collected from the Web eg Linked Data In previousworks we have encountered various problems when naiumlvely per-forming rule-based reasoning over arbitrary Linked Data [3133]In particular third-party claims made about popular classes andproperties must be critically analysed to ensure their trustworthi-ness As a single example of the type of claims that can cause issueswith reasoning we found one document that defines nine propertiesas the domain of rdftype6 thereafter the semantics ofrdfsdomainmandate that everything which is a member of any class(ie almost every known resource) can be inferred to be a lsquolsquomemberrsquorsquo ofeach of the nine properties We found that naiumlve reasoning lead to200 more inferences than would be expected when consideringthe definitions of classes and properties as defined in their lsquolsquoname-space documentsrsquorsquo Thus in the general case performing rule-basedmaterialisation with respect to OWL semantics over arbitrary LinkedData requires some critical analysis of the source of data as per ourauthoritative reasoning algorithm Please see [9] for a detailed analysisof the explosion in inferences that occurs when authoritative analysisis not applied

Such observations prompted us to investigate more robustforms of reasoning Along these lines when reasoning over LinkedData we apply an algorithm called authoritative reasoning[143133] which critically examines the source of terminologicaltriples and conservatively rejects those that cannot be definitivelytrusted ie those that redefine classes andor properties outside ofthe namespace document

6 viz httpwwweiaonetrdf10

Dereferencing and authority We first give the functionhttpU 2G that maps a URI to an RDF graph it returns upon aHTTP lookup that returns the response code 200 Okay in the caseof failure the function maps to an empty RDF graph Next we de-fine the function redirU U that follows a (finite) sequence ofHTTP redirects given upon lookup of a URI or that returns theURI itself in the absence of a redirect or in the case of failure thisfunction first strips the fragment identifier (given after the lsquorsquo char-acter) from the original URI if present Then we give the dereferenc-ing function derefhttpredir as the mapping from a URI (a Weblocation) to an RDF graph it may provide by means of a given HTTPlookup that follows redirects Finally we can define the authorityfunctionmdashthat maps the URI of a Web document to the set of RDFterms it is authoritative formdashas follows

auth U 2C

u fc 2 U j redirethcTHORN frac14 ug [ BethhttpethsTHORNTHORNwhere B(G) denotes the set of blank-nodes appearing in a graph GIn other words a Web document is authoritative for URIs that dere-ference to it and the blank nodes it contains

We note that making RDF vocabulary URIs dereferenceable isencouraged in various best-practices [454] To enforce authorita-tive reasoning when applying rules with terminological and asser-tional patterns in the body we require that terminological triplesare served by a document authoritative for terms that are boundby specific variable positions in the rule

Authoritative T-split rule application Given a T-split rule and arule application involving a mapping l as before for the rule appli-cation to be authoritative there must additionally exist a l(v) suchthat v 2 VethAnteT THORN VethAnteGTHORN l(v) 2 auth(u) lethAnteT THORN httpethuTHORNWe call the set of variables that appear in both the terminologicaland assertional segment of the rule body (ie VethAnteT THORN VethAnteGTHORN) the set of authoritative variables for the rule Where a ruledoes not have terminological or assertional patterns we considerany rule-application as authoritative The notation of an authorita-tive T-ground rule follows likewise

Example Let Rprp-ifp denote the same rule as used in the previousexample and let the following two triples be given by the derefer-enced document of v1isbn but not v2mboxmdashie a documentauthoritative for the former term but not the latter

(v1isbn a owlInverseFunctionalProperty)(v2mbox a owlInverseFunctionalProperty)

The only authoritative variable appearing in both the termino-logical and assertional patterns of the body of Rprp-ifp is p boundby v1isbn and v2mbox above Since the document serving thetwo triples is authoritative for v1isbn the first triple is consid-ered authoritative and the following T-ground rule will begenerated

(x1 v1isbn y)(x2 v1isbn y))(x1 owlsameAs x2)

However since the document is not authoritative for v2mboxthe second T-ground rule will be considered non-authoritative anddiscarded

(x1 v2mbox y)(x2 v2mbox y))(x1 owlsameAs x2)

where third-party documents re-define remote terms in a way thataffects inferencing over instance data we filter such non-authoritative definitions

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 81

26 OWL 2 RLRDF rules

Inference rules can be used to (partially) support the semanticsof RDFS and OWL Along these lines the OWL 2 RLRDF [24] rulesetis a partial-axiomatisation of the OWL 2 RDF-Based Semanticswhich is applicable for arbitrary RDF graphs and constitutes anextension of the RDF Semantics [28] in other words the rulesetsupports a standardised profile of reasoning that partially coversthe complete OWL semantics Interestingly for our scenario thisprofile includes inference rules that support the semantics andinference of owlsameAs as discussed

First in Table 1 we provide the set of OWL 2 RLRDF rules thatsupport the (positive) semantics of owl sameAs axiomatising thesymmetry (rule eq-sym) and transitivity (rule eq-trans) of therelation To take an example rule eq-sym is as follows

(x owlsameAs y))(y owlsameAs x)

where if we know that (a owlsameAs b) the rule will inferthe symmetric relation (b owlsameAs a) Rule eq-trans oper-ates analogously both together allow for computing the transitivesymmetric closure of the equivalence relation Furthermore Table 1also contains the OWL 2 RLRDF rules that support the semantics ofreplacement (rules eq-rep-) whereby data that holds for one en-tity must also hold for equivalent entities

Note that we (optionally and in the case of later evaluation)choose not to support

(i) the reflexive semantics of owlsameAs since reflexiveowlsameAs statements will not lead to any consolidationor other non-reflexive equality relations and will produce alarge bulk of materialisations

(ii) equality on predicates or values for rdftype where we donot want possibly imprecise owlsameAs data to affectterms in these positions

Given that the semantics of equality is quadratic with respect tothe A-Box we apply a partial-materialisation approach that givesour notion of consolidation instead of materialising (ie explicitlywriting down) all possible inferences given by the semantics ofreplacement we instead choose one canonical identifier to repre-sent the set of equivalent terms thus effectively compressing thedata We have used this approach in previous works [30ndash32] andit has also appeared in related works in the literature [706] as acommon-sense optimisation for handling data-level equality Totake an example in later evaluation (cf Table 10) we find a validequivalence class (set of equivalent entities) with 33052 mem-bers materialising all pairwise non-reflexive owlsameAs state-ments would infer more than 1 billion owlsameAs relations(330522 33052) = 1092401652 further assuming that eachentity appeared in on average eg two quadruples we would inferan additional 2 billion of massively duplicated data By choosinga single canonical identifier we would instead only materialise100 thousand statements

Note that although we only perform partial materialisation wedo not change the semantics of equality our methods are sound

Table 1Rules that support the positive semantics of owl sameAs

OWL2RL Antecedent ConsequentAssertional

eq-sym x owlsameAs y y owlsameAs x eq-trans x owlsameAs y y owlsameAs z x owlsameAs z eq-rep-s s owlsameAs s0 s p o s0 p o eq-rep-o o owlsameAs o0 s p o s p o0

with respect to OWL semantics In addition alongside the partiallymaterialised data we provide a set of consolidated owl sameAs

relations (containing all of the identifiers in each equivalence class)that can be used to lsquolsquobackward-chainrsquorsquo the full inferences possiblethrough replacement (as required) Thus we do not consider thecanonical identifier as somehow lsquodefinitiversquo or superceding theother identifiers but merely consider it as representing the equiva-lence class7

Finally herein we do not consider consolidation of literalsone may consider useful applications eg for canonicalisingdatatype literals but such discussion is out of the currentscope

As previously mentioned OWL 2 RLRDF also contains rules thatuse terminological knowledge (alongside assertional knowledge)to directly infer owl sameAs relations We enumerate these rulesin Table 2 note that we italicise the labels of rules supporting fea-tures new to OWL 2 and that we list authoritative variables inbold8

Applying only these OWL 2 RLRDF rules may miss inference ofsome owlsameAs statements For example consider the exampleRDF data given in Turtle syntax [3] as follows

From the FOAF Vocabulary

foafhomepage rdfssubPropertyOf

foafisPrimaryTopicOf

foafisPrimaryTopicOf owlinverseOf

foafprimaryTopic

foafisPrimaryTopicOf a

owlInverseFunctionalProperty

From Example Document A

exAaxel foafhomepage lthttppolleresnetgt From Example Document B

an

apclsvo

lthttppolleresnetgt foafprimaryTopic exBapolleres

Inferred through prp-spo1

exAaxel foafisPrimaryTopicOf lthttppolleresnetgt

Inferred through prp-invexBapolleres foafisPrimaryTopicOf lthttppolleresnetgt

Subsequently inferred through prp-ifpexAaxel owlsameAs exBapolleres

exBapolleres owlsameAs exAaxel

The example uses properties from the prominent FOAF vocabu-lary which is used for publishing personal profiles as RDF Here weadditionally need OWL 2 RLRDF rules prp-inv and prp-spo1mdashhandling standard owlinverseOf and rdfssubPropertyOf

inferencing respectivelymdashto infer the owlsameAs relationentailed by the data

Along these lines we support an extended set of OWL 2 RLRDFrules that contain precisely one assertional pattern and for whichwe have demonstrated a scalable implementation called SAORdesigned for Linked Data reasoning [33] These rules are listed inTable 3 and as per the previous example can generateadditional inferences that indirectly lead to the derivation of newowl sameAs data We will use these rules for entity consolidationin Section 5

Lastly as we discuss later (particularly in Section 7) Linked Datais inherently noisy and prone to mistakes Although our methods

7 We may optionally consider non-canonical blank-node identifiers as redundand discard them

8 As discussed later our current implementation requires at least one variable topear in all assertional patterns in the body and so we do not support prp-key and-maxqc3 however these two features are not commonly used in Linked Datacabularies [29]

t

Table 2OWL 2 RLRDF rules that directly produce owl sameAs relations we denote authoritative variables with bold and italicise the labels of rules requiring new OWL 2 constructs

OWL2RL Antecedent Consequent

Terminological Assertional

prp-fp p a owlFunctionalProperty x p y1 y2 y1 owlsameAs y2

prp-ifp p a owlInverseFunctionalProperty x1 p y x2 p y x1 owlsameAs x2

cls-maxc2 x owlmaxCardinality 1 u a x y1 owlsameAs y2

x owlonProperty p u p y1 y2

cls-maxqc4 x owlmaxQualifiedCardinality 1 u a x y1 owlsameAs y2

x owlonProperty p u p y1 y2

x owlonClass owl Thing

Table 3OWL 2 RLRDF rules containing precisely one assertional pattern in the body with authoritative variables in bold (cf[33])

OWL2RL Antecedent Consequent

Terminological Assertional

prp-dom p rdfsdomain c x p y x a cprp-rng p rdfsrange c x p y y a cprp-symp p a owlSymmetricProperty x p y y p xprp-spo1 p1 rdfssubPropertyOf p2 x p1 y x p2 yprp-eqp1 p1 owlequivalentProperty p2 x p1 y x p2 yprp-eqp2 p1 owlequivalentProperty p2 x p2 y x p1 yprp-inv1 p1 owlinverseOf p2 x p1 y y p2 xprp-inv2 p1 owlinverseOf p2 x p2 y y p1 xcls-int2 c owlintersectionOf (c1 cn) x a c x a c1 cn

cls-uni c owlunionOf (c1 ci cn) x a ci x a ccls-svf2 x owlsomeValuesFrom owlThing owlonProperty p u p v u a xcls-hv1 x owlhasValue y owlonProperty p u a x u p ycls-hv2 x owlhasValue y owlonProperty p u p y u a xcax-sco c1 rdfssubClassOf c2 x a c1 x a c2

cax-eqc1 c1 owlequivalentClass c2 x a c1 x a c2

cax-eqc2 c1 owlequivalentClass c2 x a c2 x a c1

Table 4OWL 2 RLRDF rules used to detect inconsistency that we currently support wedenote authoritative variables with bold

OWL2RL Antecedent

Terminological Assertional

eq-diff1 ndash x owlsameAs yx owldifferentFrom y

prp-irp p a owlIrreflexiveProperty x p xprp-asyp p a owlAsymmetricProperty x p it y y p xprp-pdw p1owlpropertyDisjointWith p2 x p1 y p2 yprp-adp x a owlAllDisjointProperties u pi y pj y (i ndash j)

x owlmembers (p1 pn)cls-com c1 owlcomplementOf c2 x a c1 c2

cls-maxc1 x owlmaxCardinality 0 u a x p yx owlonProperty p

cls-maxqc2 x owlmaxQualifiedCardinality 0 u a x p yx owlonProperty px owlonClass owlThing

cax-dw c1 owldisjointWith c2 x a c1c2cax-adc x a owlAllDisjointClasses z a ci cj (i ndash j)

x owlmembers (c1 cn)

82 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

are sound (ie correct) with respect to formal OWL semanticssuch noise in our input data may lead to unintended consequenceswhen applying reasoning which we wish to minimise andorrepair Relatedly OWL contains features that allow for detectingformal contradictionsmdashcalled inconsistenciesmdashin RDF data Anexample of an inconsistency would be where something is amember of two disjoint classes such as foafPerson andfoafOrganization formally the intersection of such disjointclasses should be empty and they should not share membersAlong these lines OWL 2 RLRDF contains rules (with the specialsymbol false in the head) that allow for detecting inconsistencyin an RDF graph We use a subset of these consistency-checkingOWL 2 RLRDF rules later in Section 7 to try to automatically detectand repair unintended owlsameAs inferences the supportedsubset is listed in Table 4

27 Distribution architecture

To help meet our scalability requirements our methods areimplemented on a shared-nothing distributed architecture [64]over a cluster of commodity hardware The distributed frameworkconsists of a master machine that orchestrates the given tasks andseveral slave machines that perform parts of the task in parallel

The master machine can instigate the following distributedoperations

ndash Scatter partition on-disk data using some local split functionand send each chunk to individual slave machines for subse-quent processing

ndash Run request the parallel execution of a task by the slavemachinesmdashsuch a task either involves processing of somedata local to the slave machine or the coordinate method(described later) for reorganising the data under analysis

ndash Gather gathers chunks of output data from the slave swarmand performs some local merge function over the data

ndash Flood broadcast global knowledge required by all slavemachines for a future task

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 83

The master machine provides input data to the slave swarmprovides the control logic required by the distributed task (com-mencing tasks coordinating timing ending tasks) gathers andlocally perform tasks on global knowledge that the slave machineswould otherwise have to replicate in parallel and transmits glob-ally required knowledge

The slave machines as well as performing tasks in parallel canperform the following distributed operation (at the behest of themaster machine)

ndash Coordinate local data on each slave machine is partitionedaccording to some split function with the chunks sent toindividual machines in parallel each slave machine alsogathers the incoming chunks in parallel using some mergefunction

The above operation allows slave machines to reorganise(splitsendgather) intermediary amongst themselves the coor-dinate operation could be replaced by a pair of gatherscatteroperations performed by the master machine but we wish toavoid the channelling of all intermediary data through onemachine

Note that herein we assume that the input corpus is evenlydistributed and split across the slave machines and that theslave machines have roughly even specifications that is we donot consider any special form of load balancing but insteadaim to have uniform machines processing comparable data-chunks

We note that there is the practical issue of the master ma-chine being idle waiting for the slaves and more critically thepotentially large cluster of slave machines waiting idle for themaster machine One could overcome idle times with maturetask-scheduling (eg interleaving jobs) and load-balancing Froman algorithmic point of view removing the central coordinationon the master machine may enable better distributability Onepossibility would be to allow the slave machines to duplicatethe aggregation of global knowledge in parallel although thiswould free up the master machine and would probably takeroughly the same time duplicating computation wastes resourcesthat could otherwise be exploited by eg interleaving jobs Asecond possibility would be to avoid the requirement for globalknowledge and to coordinate upon the larger corpus (eg a coor-dinate function hashing on the subject and object of the data orperhaps an adaptation of the SpeedDate routing strategy [51])Such decisions are heavily influenced by the scale of the taskto perform the percentage of knowledge that is globally requiredhow the input data are distributed how the output data shouldbe distributed and the nature of the cluster over which it shouldbe performed and the task-scheduling possible The distributedimplementation of our tasks are designed to exploit a relativelysmall percentage of global knowledge which is cheap to coordi-nate and we choose to avoidmdashinsofar as reasonablemdashduplicatingcomputation

28 Experimental setup

Our entire code-base is implemented on top of standard Javalibraries We instantiate the distributed architecture using JavaRMI libraries and using the lightweight open-source Java RMIIOpackage9 for streaming data for the network

9 httpopenhmssourceforgenetrmiio

10 We observe eg a max FTP transfer rate of 38MBs between machines11 Please see [32 Appendix A] for further statistics relating to this corpus

All of our evaluation is based on nine machines connected byGigabit ethernet10 each with uniform specifications viz 22 GHzOpteron x86-64 4 GB main memory 160 GB SATA hard-disks run-ning Java 160_12 on Debian 504

3 Experimental corpus

Later in this paper we discuss the performance and results ofapplying our methods over a corpus of 1118 billion quadruplesderived from an open-domain RDFXML crawl of 3985 millionweb documents in mid-May 2010 (detailed in [3229]) The crawlwas conducted in a breadth-first manner extracting URIs from allpositions of the RDF data Individual URI queues were assigned todifferent pay-level-domains (aka PLDs domains that require pay-ment eg deriie datagovuk) where we enforced a polite-ness policy of accessing a maximum of two URIs per PLD persecond URIs with the highest inlink count per each PLD queuewere polled first

With regards the resulting corpus of the 1118 billion quads1106 billion are unique and 947 million are unique triples Thedata contain 23 thousand unique predicates and 105 thousandunique class terms (terms in the object position of an rdftype

triple) In terms of diversity the corpus consists of RDF from 783pay-level-domains [32]

Note that further details about the parameters of the crawl andstatistics about the corpus are available in [3229]

Now we discuss the usage of terms in a data-level position vizterms in the subject position or object position of non-rdftypetriples11 Since we do not consider the consolidation of literals orschema-level concepts we focus on characterising blank-node andURI re-use in such data-level positions thus rendering a picture ofthe structure of the raw data

We found 2863 million unique terms of which 1654 million(578) were blank-nodes 921 million (322) were URIs and289 million (10) were literals With respect to literals each hadon average 9473 data-level occurrences (by definition all in theobject position)

With respect to blank-nodes each had on average 5233 data-level occurrences Each occurred on average 0995 times in the ob-ject position of a non-rdftype triple with 31 million (187) notoccurring in the object position conversely each occurred on aver-age 4239 times in the subject position of a triple with 69 thousand(004) not occurring in the subject position Thus we summarisethat almost all blank-nodes appear in both the subject position andobject position but occur most prevalently in the former Impor-tantly note that in our input blank-nodes cannot be re-used acrosssources

With respect to URIs each had on average 941 data-leveloccurrences (18 the average for blank-nodes) with 4399 aver-age appearances in the subject position and 501 appearances inthe object positionmdash1985 million (2155) did not appear in anobject position whilst 5791 million (6288) did not appear in asubject position

With respect to re-use across sources each URI had a data-leveloccurrence in on average 47 documents and 1008 PLDsmdash562million (6102) of URIs appeared in only one document and913 million (9913) only appeared in one PLD Also re-use of URIsacross documents was heavily weighted in favour of use in the ob-ject position URIs appeared in the subject position in on average1061 documents and 0346 PLDs for the object position of

84 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

non-rdftype triples URIs occurred in on average 3996 docu-ments and 0727 PLDs

The URI with the most data-level occurrences (166 million)was httpidentica the URI with the most re-use acrossdocuments (appearing in 1793 thousand documents) washttpcreativecommonsorglicensesby30 the URIwith the most re-use across PLDs (appearing in 80 different do-mains) was httpwwwldoddscomfoaffoaf-a-maticAlthough some URIs do enjoy widespread re-use across differentdocuments and domains in Figs 1 and 2 we give the distribu-tion of re-use of URIs across documents and across PLDs wherea powerndashlaw relationship is roughly evidentmdashagain the majorityof URIs only appear in one document (61) or in one PLD(99)

From this analysis we can conclude that with respect to data-level terms in our corpus

ndash blank-nodes which by their very nature cannot be re-usedacross documents are 18 more prevalent than URIs

ndash despite a smaller number of unique URIs each one is used in(probably coincidentally) 18 more triples

ndash unlike blank-nodes URIs commonly only appear in either a sub-ject position or an object position

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100 1000 10000 100000 1e+006

num

ber o

f UR

Is

number of documents mentioning URI

Fig 1 Distribution of URIs and the number of documents they appear in (in a data-position)

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100

num

ber o

f UR

Is

number of PLDs mentioning URI

Fig 2 Distribution of URIs and the number of PLDs they appear in (in a data-position)

ndash each URI is re-used on average in 47 documents but usuallyonly within the same domainmdashmost external re-use is in theobject position of a triple

ndash 99 of URIs appear in only one PLD

We can conclude that within our corpusmdashitself a general crawlfor RDFXML on the Webmdashwe find that there is only sparse re-use of data-level terms across sources and particularly acrossdomains

Finally for the purposes of demonstrating performance acrossvarying numbers of machines we extract a smaller corpus com-prising of 100 million quadruples from the full evaluation corpusWe extract the sub-corpus from the head of the raw corpus sincethe data are ordered by access time polling statements from thehead roughly emulates a smaller crawl of data which shouldensure eg that all well-linked vocabularies are containedtherein

4 Base-line consolidation

We now present the lsquolsquobase-linersquorsquo algorithm for consolidationthat consumes asserted owlsameAs relations in the dataLinked Data best-practices encourage the provision ofowlsameAs links between exporters that coin different URIsfor the same entities

lsquolsquoIt is common practice to use the owlsameAs property for statingthat another data source also provides information about a specificnon-information resourcersquorsquo [7]

We would thus expect there to be a significant amount of expli-cit owlsameAs data present in our corpusmdashprovided directly bythe Linked Data publishers themselvesmdashthat can be directly usedfor consolidation

To perform consolidation over such data the first step is toextract all explicit owlsameAs data from the corpus We mustthen compute the transitive and symmetric closure of the equiv-alence relation and build equivalence classes sets of coreferentidentifiers Note that the set of equivalence classes forms a parti-tion of coreferent identifiers where due to the transitive andsymmetric semantics of owlsameAs each identifier can only ap-pear in one such class Also note that we do not need to considersingleton equivalence classes that contain only one identifierconsolidation need not perform any action if an identifier isfound only to be coreferent with itself Once the closure of as-serted owlsameAs data has been computed we then need tobuild an index that maps identifiers to the equivalence class inwhich it is contained This index enables lookups of coreferentidentifiers Next in the consolidated data we would like to col-lapse the data mentioning identifiers in each equivalence classto instead use a single consistent canonical identifier for eachequivalence class we must thus choose a canonical identifier Fi-nally we can scan the entire corpus and using the equivalenceclass index rewrite the original identifiers to their canonincalform As an optional step we can also add links between eachcanonical identifier and its coreferent forms using an artificalowlsameAs relation in the output thus persisting the originalidentifiers

The distributed algorithm presented in this section is basedon an in-memory equivalence class closure and index and isthe current method of consolidation employed by the SWSE sys-tem described previously in [32] Herein we briefly reintroducethe approach from [32] where we also add new performanceevaluation over varying numbers of machines in the distributedsetup present more discussion and analysis of the use ofowlsameAs in our Linked Data corpus and manually evaluatethe precision of the approach for an extended sample of one

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 85

thousand coreferent pairs In particular this section serves as abaseline for comparison against the extended consolidation ap-proach that uses richer reasoning features explored later inSection 5

41 High-level approach

Based on the previous discussion the approach is straight-forward

(i) scan the corpus and separate out all asserted owlsameAs

relations from the main body of the corpus(ii) load these relations into an in-memory index that encodes

the transitive and symmetric semantics of owlsameAs(iii) for each equivalence class in the index choose a canonical

term(iv) scan the corpus again canonicalising any term in the

subject position or object position of an rdftype

triple

Thus we need only index a small subset of the corpusmdashowlsameAs statementsmdashand can apply consolidation by meansof two scans

The non-trivial aspects of the algorithm are given by theequality closure and index To perform the in-memory transi-tive symmetric closure we use a traditional unionndashfind algo-rithm [6640] for computing equivalence partitions where (i)equivalent elements are stored in common sets such that eachelement is only contained in one set (ii) a map provides a find

function for looking up which set an element belongs to and(iii) when new equivalent elements are found the sets they be-long to are unioned The process is based on an in-memory mapindex and is detailed in Algorithm 1 where

(i) the eqc labels refer to equivalence classes and s and o refer toRDF subject and object terms

(ii) the function mapget refers to an identifier lookup on the in-memory index that should return the intermediary equiva-lence class associated with that identifier (ie find)

(iii) the function mapput associates an identifier with a newequivalence class

The output of the algorithm is an in-memory map from identi-fiers to their respective equivalence class

Next we must choose a canonical term for each equivalenceclass we prefer URIs over blank-nodes thereafter choosing a termwith the lowest alphabetical ordering a canonical term is thusassociated to each equivalent set Once the equivalence index hasbeen finalised we re-scan the corpus and canonicalise the datausing the in-memory index to service lookups

42 Distributed approach

Again distribution of the approach is fairly intuitive as follows

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract owlsameAs relations

(ii) Gather gather all owlsameAs relations onto the mastermachine and build the in-memory equality index

(iii) Floodrun send the equality index (in its entirety) to eachslave machine and apply the consolidation scan in parallel

As we will see in the next section the most expensive meth-odsmdashinvolving the two scans of the main corpusmdashcan be con-ducted in parallel

Algorithm 1

Building equivalence map [32]

Require SAMEAS DATA SA1 map 2 for (sowlsameAso) 2 SA ^ s ndash o do3 eqcs mapget(s)4 eqco mapget(o)5 if eqcs = ^ eqco = then6 eqcs[o so7 mapput(seqcs[o)8 mapput(oeqcs[o)9 else if eqcs = then

10 add s to eqco

11 mapput(seqco)12 else if eqco = then13 add o to eqcs

14 mapput(oeqcs)15 else if eqcs ndash eqco then16 if jeqcsj gt jeqcoj17 add all eqco into eqcs

18 for eo 2 eqco do19 mapput(eoeqcs)20 end for21 else22 add all eqcs into eqco

23 for es 2 eqcs do24 mapput(eseqco)25 end for26 end if27 end if28 end for

43 Performance evaluation

We applied the distributed base-line consolidation over our cor-pus with the aformentioned procedure and setup The entire con-solidation process took 633 min with the bulk of time taken asfollows the first scan extracting owlsameAs statements took125 min with an average idle time for the servers of 11 s(14)mdashie on average the slave machines spent 14 of the timeidly waiting for peers to finish Transferring aggregating and load-ing the owlsameAs statements on the master machine took84 min The second scan rewriting the data according to thecanonical identifiers took in total 423 min with an average idletime of 647 s (25) for each machine at the end of the roundThe slower time for the second round is attributable to the extraoverhead of re-writing the GZip-compressed data to disk as op-posed to just reading

In the rightmost column of Table 5 (full-8) we give a break-down of the timing for the tasks over the full corpus using eightslave machines and one master machine Independent of the num-ber of slaves we note that the master machine required 85 min forcoordinating globally-required owlsameAs knowledge and thatthe rest of the task time is spent in embarrassingly parallel execu-tion (amenable to reduction by increasing the number of ma-chines) For our setup the slave machines were kept busy for onaverage 846 of the total task time of the idle time 87 wasspent waiting for the master to coordinate the owlsameAs dataand 13 was spent waiting for peers to finish their task due tosub-optimal load balancing The master machine spent 866 ofthe task idle waiting for the slaves to finish

In addition Table 5 also gives performance results for onemaster and varying numbers of slave machines (1248) over the

Table 5Breakdown of timing of distributed baseline consolidation

Category 1 2 4 8 Full-8

Min Min Min min Min

MasterTotal execution time 422 1000 223 1000 119 1000 67 1000 633 1000Executing 05 11 05 22 05 40 05 75 85 134

Aggregating owl sameAs 04 11 05 22 05 40 05 75 84 133Miscellaneous 01 01 01 04 01 01

Idle (waiting for slaves) 418 989 218 977 114 958 62 921 548 866

SlavesAvg Executing (total) 418 989 218 977 114 958 62 921 535 846

Extract owl sameAs 90 214 45 202 22 184 11 168 123 195Consolidate 327 775 171 769 88 735 47 705 412 651

Avg Idle 05 11 07 29 10 82 08 127 98 154Waiting for peers 00 00 01 07 05 40 03 47 13 20Waiting for master 05 11 05 23 05 42 05 79 85 134

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000

num

ber o

f cla

sses

equivalence class size

Fig 3 Distribution of sizes of equivalence classes on loglog scale (from [32])

1

10

100

1000

10000

100000

1e+006

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

Fig 4 Distribution of the number of PLDs per equivalence class on loglog scale

2 For example see the coreference results given by httpwwwrkbexplorercoma m e A s u r i = h t t p w w w a c m r k b e x p l o r e r c o m i d p e r s o n - 5 3 2 9 2 -2877d02973d0d01e8f29c7113776e7e which at the time of writing correspond to36 out of the 443 equivalent URIs found for Dieter Fensel

86 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

100 million quadruple corpus Note that in the table we use thesymbol to indicate a negligible value (0 lt lt 005) We see thatas the number of slave machines increases so too does the percent-age of time taken to aggregate global knowledge on the master ma-chine (the absolute time stays roughly stable) This indicates thatfor a high number of machines the aggregation of global knowl-edge will eventually become a bottleneck and an alternative distri-bution strategy may be preferable subject to further investigationOverall however we see that the total execution times roughlyhalve when the number of slave machines are doubled we observea 0528 0536 and 0561 reduction in total task executiontime moving from 1 slave machine to 2 2 to 4 and 4 to 8 respec-tively (here a value of 05 would indicate linear scaling out)

44 Results evaluation

We extracted 1193 million raw owlsameAs quadruples fromthe full corpus Of these however there were only 377 million un-ique triples After closure the data formed 216 million equivalenceclasses mentioning 575 million terms (624 of URIs)mdashan averageof 265 elements per equivalence class Of the 575 million termsonly 4156 were blank-nodes Fig 3 presents the distribution ofsizes of the equivalence classes where the largest equivalenceclass contains 8481 equivalent entities and 16 million (741)equivalence classes contain the minimum two equivalent identifi-ers Fig 4 shows a similar distribution but for the number of PLDsextracted from URIs in each equivalence class Interestingly themajority (571) of equivalence classes contained identifiers from

more than one PLD where the most diverse equivalence class con-tained URIs from 32 PLDs

Table 6 shows the canonical URIs for the largest five equiva-lence classes we manually inspected the results and showwhether or not the results were verified as correctincorrect In-deed results for class 1 and 2 were deemed incorrect due toover-use of owlsameAs for linking drug-related entities in theDailyMed and LinkedCT exporters Results 3 and 5 were verifiedas correct consolidation of prominent Semantic Web relatedauthors resp Dieter Fensel and Rudi Studermdashauthors are gi-ven many duplicate URIs by the RKBExplorer coreference index12

Result 4 contained URIs from various sites generally referring tothe United States mostly from DBPedia and LastFM With respectto the DBPedia URIs these (i) were equivalent but for capitilisationvariations or stop-words (ii) were variations of abbreviations or va-lid synonyms (iii) were different language versions (eg dbpediaEtats_Unis) (iv) were nicknames (eg dbpediaYankee_land)(v) were related but not equivalent (eg dbpediaAmerican_Civilization) (vi) were just noise (eg dbpediaLOL_Dean)

Besides the largest equivalence classes which we have seen areprone to errors perhaps due to the snowballing effect of the tran-sitive and symmetric closure we also manually evaluated a ran-dom sample of results For the sampling we take the closed

1

s24

Table 6Largest five equivalence classes (from [32])

Canonical term (lexically first in equivalence class) Size Correct

1 httpbio2rdforgdailymed_drugs1000 84812 httpbio2rdforgdailymed_drugs1042 8003 httpacmrkbexplorercomidperson-53292-22877d02973d0d01e8f29c7113776e7e 443 U

4 httpagame2teachcomddb61cae0e083f705f65944cc3bb3968ce3f3ab59-ge_1 353 U5 httpwwwacmrkbexplorercomidperson-236166-1b4ef5fdf4a5216256064c45a8923bc9 316 U

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 87

equivalence classes extracted from the full corpus pick an identi-fier (possibly non-canonical) at random and then randomly pickanother coreferent identifier from its equivalence class We thenretrieve the data associated for both identifiers and manually in-spect them to see if they are in fact equivalent For each pairwe then select one of four options SAME DIFFERENT UNCLEAR TRIVIALLY

SAME We used the UNCLEAR option sparingly and only where therewas not enough information to make an informed decision or fordifficult subjective choices note that where there was not enoughinformation in our corpus we tried to dereference more informa-tion to minimise use of UNCLEAR in terms of difficult subjectivechoices one example pair we marked unclear was dbpediaFolk-core and dbpediaExperimental_folk We used the TRIVIALLY

SAME option to indicate that an equivalence is purely lsquolsquosyntacticrsquorsquowhere one identifier has no information other than being the targetof a owlsameAs link although the identifiers are still coreferentwe distinguish this case since the equivalence is not so lsquolsquomeaning-fulrsquorsquo For the 1000 pairs we choose option TRIVIALLY SAME 661 times(661) SAME 301 times (301) DIFFERENT 28 times (28) and UN-

CLEAR 10 times (1) In summary our manual verification of thebaseline results puts the precision at 972 albeit with manysyntatic equivalences found Of those found to be incorrect manywere closely-related DBpedia resources that were linked to by thesame resource on the Freebase exporter an example of this wasdbpediaRock_Hudson and dbpediaRock_Hudson_filmogra-phy having owlsameAs links from the same Freebase conceptfbrock_hudson

Moving on in Table 7 we give the most frequently co-occurringPLD-pairs in our equivalence classes where datasets resident onthese domains are lsquolsquoheavilyrsquorsquo interlinked with owl sameAs rela-tions We italicise the indexes of inter-domain links that were ob-served to often be purely syntactic

With respect to consolidation identifiers in 786 million subjectpositions (7 of subject positions) and 232 million non-rdftype-object positions (26) were rewritten giving a total of1019 million positions rewritten (51 of total rewritable posi-tions) The average number of documents mentioning each URIrose slightly from 4691 to 4719 (a 06 increase) due to consoli-dation and the average number of PLDs also rose slightly from1005 to 1007 (a 02 increase)

Table 7Top 10 PLD pairs co-occurring in the equivalence classes with number of equivalenceclasses they co-occur in

PLD PLD Co-occur

1 rdfizecom uriburnercom 5066232 dbpediaorg freebasecom 1870543 bio2rdforg purlorg 1853924 locgov infolcauthorities

a 1666505 l3sde rkbexplorercom 1258426 bibsonomyorg l3sde 1258327 bibsonomyorg rkbexplorercom 1258308 dbpediaorg mpiide 998279 freebasecom mpiide 8961010 dbpediaorg umbelorg 65565

a In fact these were within the same PLD

5 Extended reasoning consolidation

We now look at extending the baseline approach to includemore expressive reasoning capabilities in particular using OWL 2RLRDF rules (introduced previously in Section 26) to infer novelowlsameAs relations that can then be used for consolidationalongside the explicit relations asserted by publishers

As described in Section 22 OWL provides various featuresmdashincluding functional properties inverse-functional properties andcertain cardinality restrictionsmdashthat allow for inferring novelowlsameAs data Such features could ease the burden on publish-ers of providing explicit owlsameAs mappings to coreferent iden-tifiers in remote naming schemes For example inverse-functionalproperties can be used in conjunction with legacy identificationschemesmdashsuch as ISBNs for books EAN UCC-13 or MPN for prod-ucts MAC addresses for network-enabled devices etcmdashto boot-strap identity on the Web of Data within certain domains suchidentification values can be encoded as simple datatype stringsthus bypassing the requirement for bespoke agreement or map-pings between URIs Also lsquolsquoinformation resourcesrsquorsquo with indigenousURIs can be used for indirectly identifying related resources towhich they have a functional mapping where examples includepersonal email-addresses personal homepages WIKIPEDIA articlesetc

Although Linked Data literature has not explicitly endorsed orencouraged such usage prominent grass-roots efforts publishingRDF on the Web rely (or have relied) on inverse-functional proper-ties for maintaining consistent identity For example in the FriendOf A Friend (FOAF) community a technique called smushing wasproposed to leverage such properties for identity serving as anearly precursor to methods described herein13

Versus the baseline consolidation approach we must first ap-ply some reasoning techniques to infer novel owlsameAs rela-tions Once we have the inferred owlsameAs data materialisedand merged with the asserted owlsameAs relations we can thenapply the same consolidation approach as per the baseline (i) ap-ply the transitive symmetric closure and generate the equivalenceclasses (ii) pick canonical identifiers for each class and (iii) re-write the corpus However in contrast to the baseline we expectthe extended volume of owlsameAs data to make in-memorystorage infeasible14 Thus in this section we avoid the need tostore equivalence information in-memory where we instead inves-tigate batch-processing methods for applying an on-disk closure ofthe equivalence relations and likewise on-disk methods for rewrit-ing data

Early versions of the local batch-processing [31] techniques andsome of the distributed reasoning techniques [33] are borrowedfrom previous works Herein we reformulate and combine thesetechniques into a complete solution for applying enhanced consol-idation in a distributed scalable manner over large-scale Linked

13 cf httpwikifoaf-projectorgwSmushing retr 2011012214 Further note that since all machines must have all owlsameAs information

adding more machines does not increase the capacity for handling more suchrelations the scalability of the baseline approach is limited by the machine with theleast in-memory capacity

88 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Data corpora We provide new evaluation over our 1118 billionquadruple evaluation corpus presenting new performance resultsover the full corpus and for a varying number of slave machinesWe also analyse in depth the results of applying the techniquesover our evaluation corpus analysing the applicability of ourmethods in such scenarios We contrast the results given by thebaseline and extended consolidation approaches and again manu-ally evaluate the results of the extended consolidation approach for1000 pairs found to be coreferent

51 High-level approach

In Table 2 we provide the pertinent rules for inferring new owl

sameAs relations from the data However after analysis of thedata we observed that no documents used the owlmaxQuali-

fiedCardinality construct required for the cls-maxqc rulesand that only one document defined one owlhasKey axiom15

involving properties with less than five occurrences in the datamdashhence we leave implementation of these rules for future work andnote that these new OWL 2 constructs have probably not yet had timeto find proper traction on the Web Thus on top of inferencing involv-ing explicit owlsameAs we are left with rule prp-fp which supportsthe semantics of properties typed owlFunctionalProperty andrule prp-ifp which supports the semantics of properties typedowlInverseFunctionalProperty and rule cls-maxc1 whichsupports the semantics of classes with a specified cardinality of 1for some defined property (a class restricted version of the func-tional-property inferencing)16

Thus we look at using OWL 2 RLRDF rules prp-fp prp-ifp andcls-maxc2 for inferring new owl sameAs relations between indi-vidualsmdashwe also support an additional rule that gives an exact car-dinality version of cls-maxc217 Herein we refer to these rules asconsolidation rules

We also investigate pre-applying more general OWL 2 RLRDFreasoning over the corpus to derive more complete results wherethe ruleset is available in Table 3 and is restricted to OWL 2 RLRDFrules with one assertional pattern [33] unlike the consolidationrules these general rules do not require the computation of asser-tional joins and thus are amenable to execution by means of a sin-gle-scan of the corpus We have demonstrated this profile ofreasoning to have good competency with respect to the features ofRDFS and OWL used in Linked Data [29] These general rules mayindirectly lead to the inference of additional owlsameAs relationsparticularly when combined with the consolidation rules of Table 2

Once we have used reasoning to derive novel owlsameAs

data we then apply an on-disk batch processing technique toclose the equivalence relation and to ultimately consolidate thecorpus

Thus our high-level approach is as follows

(i) extract relevant terminological data from the corpus(ii) bind the terminological patterns in the rules from this data

thus creating a larger set of general rules with only oneassertional pattern and identifying assertional patterns thatare useful for consolidation

(iii) apply general-rules over the corpus and buffer any inputinferred statements relevant for consolidation to a new file

(iv) derive the closure of owlsameAs statements from the con-solidation-relevant dataset

15 httphuemerlstadlernetrolerhrdf16 Note that we have presented non-distributed batch-processing execution of

these rules at a smaller scale (147 million quadruples) in [31]17 Exact cardinalities are disallowed in OWL 2 RL due to their effect on the formal

proposition of completeness underlying the profile but such considerations are mootin our scenario

(v) apply consolidation over the main corpus with respect to theclosed owlsameAs data

In Step (i) we extract terminological datamdashrequired for applica-tion of our rulesmdashfrom the main corpus As discussed in Section 25to help ensure robustness we apply authoritative reasoning inother words at this stage we discard any third-party terminologi-cal statements that affect inferencing over instance data for remoteclasses and properties

In Step (ii) we then compile the authoritative terminologicalstatements into the rules to generate a set of T-ground (purelyassertional) rules which can then be applied over the entirety ofthe corpus [33] As mentioned in Section 26 the T-ground generalrules only contain one assertional pattern and thus are amenableto execution by a single scan eg given the statement

homepage rdfssubPropertyOf isPrimaryTopicOf

we would generate a T-ground rulex homepage y

)x isPrimaryTopicOf y

which does not require the computation of joins We also bindthe terminological patterns of the consolidation rules in Table 2however these rules cannot be performed by means of a singlescan over the data since they contain multiple assertional pat-terns in the body Thus we instead extract triple patternssuch as

x isPrimaryTopicOf y

which may indicate data useful for consolidation (note that hereisPrimaryTopicOf is assumed to be inverse-functional)

In Step (iii) we then apply the T-ground general rules over thecorpus with a single scan Any input or inferred statements match-ing a consolidation-relevant patternmdashincluding any owlsameAs

statements foundmdashare buffered to a file for later join computationSubsequently in Step (iv) we must now compute the on-disk

canonicalised closure of the owlsameAs statements In particularwe mainly employ the following three on-disk primitives

(i) sequential scans of flat files containing line-delimitedtuples18

(ii) external-sorts where batches of statements are sorted inmemory the sorted batches written to disk and the sortedbatches merged to the final output and

(iii) merge-joins where multiple sets of data are sorted accordingto their required join position and subsequently scanned inan interleaving manner that aligns on the join position andwhere an in-memory join is applied for each individual joinelement

Using these primitives to compute the owlsameAs closureminimises the amount of main memory required where we havepresented similar approaches in [3031]

First assertional memberships of functional properties inverse-functional properties and cardinality restrictions (both propertiesand classes) are written to separate on-disk files For functional-property and cardinality reasoning a consistent join variable forthe assertional patterns is given by the subject position for in-verse-functional-property reasoning a join variable is given bythe object position19 Thus we can sort the former sets of data

8 These files are G-Zip compressed flat files of N-Triple-like syntax encodingrbitrary length tuples of RDF constants9 Although a predicate-position join is also available we prefer object-positionins that provide smaller batches of data for the in-memory join

1

a1

jo

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 89

according to subject and perform a merge-join by means of a linearscan thereafter the same procedure applies to the latter set sortingand merge-joining on the object position Applying merge-join scanswe produce new owlsameAs statements

Both the originally asserted and newly inferred owlsameAs

relations are similarly written to an on-disk file over which wenow wish to perform the canonicalised symmetrictransitive clo-sure We apply a similar method again leveraging external sortsand merge-joins to perform the computation (herein we sketchand point the interested reader to [31 Section 46]) In the follow-ing we use gt and lt to denote lexical ordering SA as a shortcut forowlsameAs a b c etc to denote members of U [ B such thata lt b lt c and define URIs to be lexically lower than blank nodesThe process is as follows

(i) we only materialise symmetric equality relations thatinvolve a (possibly intermediary) canonical term chosenby a lexical ordering given b SA a we materialise a SA bgiven a SA b SA c we materialise the relations a SA b aSA c and their inverses but do not materialise b SA c orits inverse

(ii) transitivity is supported by iterative merge-join scans

20 One could instead build an on-disk map for equivalence classes and pivoelements and follow a consolidation method similar to the previous section over theunordered data however we would expect such an on-disk index to have a low cachehit-rate given the nature of the data which would lead to many (slow) disk seeks Analternative approach might be to hash-partition the corpus and equality index overdifferent machines however this would require a non-trivial minimum amount omemory on the cluster

ndash in the scan if we find c SA a (sorted by object) and c SA d(sorted naturally) we infer a SA d and drop the non-canonical c SA d (and d SA c)

ndash at the end of the scan newly inferred triples are marked and merge-joined

into the main equality data any triples echoing an earlier inference are ignored dropped non-canonical statements are removed

ndash the process is then iterative in the next scan if we find dSA a and d SA e we infer a SA e and e SA a

ndash inferences will only occur if they involve a statementadded in the previous iteration ensuring that inferencesteps are not re-computed and that the computation willterminate

(iii) the above iterations stop when a fixpoint is reached andnothing new is inferred

(iv) the process of reaching a fixpoint is accelerated using avail-able main-memory to store a cache of partial equalitychains

The above steps follow well-known methods for transtive clo-sure computation modified to support the symmetry ofowlsameAs and to use adaptive canonicalisation in order toavoid quadratic output With respect to the last item we useAlgorithm 1 to derive lsquolsquobatchesrsquorsquo of in-memory equivalencesand when in-memory capacity is achieved we write thesebatches to disk and proceed with on-disk computation this isparticularly useful for computing the small number of longequality chains which would otherwise require sorts andmerge-joins over all of the canonical owlsameAs data currentlyderived and where the number of iterations would otherwise bethe length of the longest chain The result of this process is a setof canonicalised equality relations representing the symmetrictransitive closure

Next we briefly describe the process of canonicalising data withrespect to this on-disk equality closure where we again use exter-nal-sorts and merge-joins First we prune the owlsameAs indexto only maintain relations s1 SA s2 such that s1 gt s2 thus given s1

SA s2 we know that s2 is the canonical identifier and s1 is to berewritten We then sort the data according to the position thatwe wish to rewrite and perform a merge-join over both sets ofdata buffering the canonicalised data to an output file If we wantto rewrite multiple positions of a file of tuples (eg subject and

object) we must rewrite one position sort the results by the sec-ond position and then rewrite the second position20

Finally note that in the derivation of owlsameAs from the con-solidation rules prp-fp prp-ifp cax-maxc2 the overall processmay be iterative For example consider

dblpAxel foafisPrimaryTopicOf lthttppolleresnetgt

axelme foafisPrimaryTopicOf lthttpaxelderiiegt

lthttppolleresnetgt owlsameAs lthttpaxelderiiegt

from which the conclusion that dblpAxel is the same asaxelme holds We see that new owlsameAs relations (either as-serted or derived from the consolidation rules) may in turn lsquolsquoalignrsquorsquoterms in the join position of the consolidation rules leading to newequivalences Thus for deriving the final owlsameAs we require ahigher-level iterative process as follows

(i) initially apply the consolidation rules and append theresults to a file alongside the asserted owl sameAs state-ments found

(ii) apply the initial closure of the owlsameAs data(iii) then iteratively until no new owlsameAs inferences are

found

ndash rewrite the join positions of the on-disk files containing

the data for each consolidation rule according to the cur-rent owlsameAs data

ndash derive new owlsameAs inferences possible through theprevious rewriting for each consolidation rule

ndash re-derive the closure of the owlsameAs data includingthe new inferences

The final closed file of owlsameAs data can then be re-used torewrite the main corpus in two sorts and merge-join scans oversubject and object

52 Distributed approach

The distributed approach follows quite naturally from theprevious discussion As before we assume that the input dataare evenly pre-distributed over the slave machines (in anyarbitrary ordering) where we can then apply the followingprocess

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract relevant terminologicalknowledge

(ii) Gather gather terminological data onto the master machineand thereafter bind the terminological patterns of the gen-eralconsolidation rules

(iii) Flood flood the rules for reasoning and the consolidation-relevant patterns to all slave machines

(iv) Run apply reasoning and extract consolidation-relevantstatements from the input and inferred data

(v) Gather gather all consolidation statements onto the mastermachine then in parallel

t

f

Table 8Breakdown of timing of distributed extended consolidation wreasoning where identifies the masterslave tasks that run concurrently

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 4543 1000 2302 1000 1275 1000 813 1000 7404 1000Executing 299 66 303 131 325 255 301 370 2436 329

Aggregate Consolidation Relevant Data 43 09 45 20 47 37 46 57 299 40Closing owlsameAs

244 54 245 106 254 199 244 300 2112 285Miscellaneous 12 03 13 05 24 19 11 13 25 03

Idle (waiting for slaves) 4245 934 1999 869 950 745 512 630 4968 671

SlaveAvg Executing (total) 4489 988 2212 961 1116 875 565 694 5991 809

Extract Terminology 444 98 227 99 118 92 69 84 494 67Extract Consolidation Relevant Data 955 210 483 210 243 191 132 162 1379 186Initial Sort (by subject) 980 216 483 210 237 186 116 142 1338 181Consolidation 2110 464 1019 443 519 407 249 306 2780 375

Avg Idle 55 12 89 39 379 297 236 290 1413 191Waiting for peers 00 00 32 14 71 56 63 77 375 51Waiting for master 55 12 58 25 308 241 173 212 1038 140

90 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

ndash Local compute the closure of the consolidation rules andthe owlsameAs data on the master machine

ndash Run each slave machine sorts its fragment of the maincorpus according to natural order (spoc)

1 At least in terms of pure quantity However we do not give an indication of theuality or importance of those few equivalences we miss with this approximationhich may well be application specific

(vi) Flood send the closed owlsameAs data to the slavemachines once the distributed sort has been completed

(vii) Run each slave machine then rewrites the subjects of theirsegment of the corpus subsequently sorts the rewritten databy object and then rewrites the objects (of non-rdftypetriples) with respect to the closed owlsameAs data

53 Performance evaluation

Applying the above process to our 1118 billion quadruple corpustook 1234 h Extracting the terminological data took 114 h with anaverage idle time of 19 min (277) (one machine took 18 minlonger than the rest due to processing a large ontology containing2 million quadruples [32]) Merging and aggregating the termino-logical data took roughly 1 min Applying the reasoning andextracting the consolidation relevant statements took 234 h withan average idle time of 25 min (18) Aggregating and mergingthe consolidation relevant statements took 299 min Thereafterlocally computing the closure of the consolidation rules and theequality data took 352 h with the computation requiring two itera-tions overall (the minimum possiblemdashthe second iteration did notproduce any new results) Concurrent to the previous step the paral-lel sort of remote data by natural order took 233 h with an averageidle time of 6 min (43) Subsequent parallel consolidation of thedata took 48 h with 10 min (35) average idle timemdashof this 19of the time was spent consolidating the pre-sorted subjects 60of the time was spent sorting the rewritten data by object and21 of the time was spent consolidating the objects of the data

As before Table 8 summarises the timing of the task Focusing onthe performance for the entire corpus the master machine took406 h to coordinate global knowledge constituting the lower boundon time possible for the task to execute with respect to increasingmachines in our setupmdashin future it may be worthwhile to investigatedistributed strategies for computing the owlsameAs closure(which takes 285 of the total computation time) but for the mo-ment we mitigate the cost by concurrently running a sort on theslave machines thus keeping the slaves busy for 634 of the timetaken for this local aggregation step The slave machines were onaverage busy for 809 of the total task time of the idle time733 was spent waiting for the master machine to aggregate theconsolidation relevant data and to finish the closure of owl sameAs

data and the balance (267) was spent waiting for peers to finish(mostly during the extraction of terminological data)

With regards performance evaluation over the 100 million qua-druple corpus for a varying number of slave machines perhapsmost notably the average idle time of the slave machines increasesquite rapidly as the number of machines increases In particular by4 machines the slaves can perform the distributed sort-by-subjecttask faster than the master machine can perform the local owlsameAs closure The significant workload of the master machinewill eventually become a bottleneck as the number of slave ma-chines increases further This is also noticeable in terms of overalltask time the respective time savings as the number of slave ma-chines doubles (moving from 1 slave machine to 8) are 0507 0554 and 0638 respectively

Briefly we also ran the consolidation without the general rea-soning rules (Table 3) motivated earlier With respect to perfor-mance the main variations were given by (i) the extraction ofconsolidation relevant statementsmdashthis time directly extractedfrom explicit statements as opposed to explicit and inferred state-mentsmdashwhich took 154 min (11 of the time taken including thegeneral reasoning) with an average idle time of less than one min-ute (6 average idle time) (ii) local aggregation of the consolida-tion relevant statements took 17 min (569 of the time takenpreviously) (iii) local closure of the owlsameAs data took318 h (904 of the time taken previously) The total time savedequated to 28h (227) where 333 min were saved from coordi-nation on the master machine and 225 h were saved from parallelexecution on the slave machines

54 Results evaluation

In this section we present the results of the consolidationwhich included the general reasoning step in the extraction ofthe consolidation-relevant statements In fact we found that theonly major variation between the two approaches was in theamount of consolidation-relevant statements collected (discussedpresently) where the changes in the total amounts of coreferentidentifiers equivalence classes and canonicalised terms were infact negligible (lt01) Thus for our corpus extracting only as-serted consolidation-relevant statements offered a very closeapproximation of the extended reasoning approach21

2

qw

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 5: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

80 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

apply over assertional (ie any) knowledge Such rules enable anoptimisation whereby terminological patterns are pre-bound togenerate a new larger set of purely assertional rules called T-ground rules We illustrate this with an example

Example Let Rprp-ifp denote the following rule

(p a owl InverseFunctionalProperty)(x1 p y)(x2 p y))(x1 owl sameAs x2)

When writing T-split rules we denote terminological patternsin the body by underlining Also we use lsquoarsquo as a convenient short-cut for rdftype which indicates class-membership Now take theterminological triples

(isbn a owlInverseFunctionalProperty)(mbox a owlInverseFunctionalProperty)

From these two triples we can generate two T-ground rules asfollows

(x1 isbn y)(x2 isbn y))(x1 owl sameAs x2)

(x1 mbox y)(x2 mbox y))(x1 owl sameAs x2)

These T-ground rules encode the terminological OWL knowledgegiven by the previous two triples and can be applied directly overthe assertional data eg describing specific books with ISBNnumbers

25 Authoritative T-split rules

Caution must be exercised when applying reasoning over arbi-trary data collected from the Web eg Linked Data In previousworks we have encountered various problems when naiumlvely per-forming rule-based reasoning over arbitrary Linked Data [3133]In particular third-party claims made about popular classes andproperties must be critically analysed to ensure their trustworthi-ness As a single example of the type of claims that can cause issueswith reasoning we found one document that defines nine propertiesas the domain of rdftype6 thereafter the semantics ofrdfsdomainmandate that everything which is a member of any class(ie almost every known resource) can be inferred to be a lsquolsquomemberrsquorsquo ofeach of the nine properties We found that naiumlve reasoning lead to200 more inferences than would be expected when consideringthe definitions of classes and properties as defined in their lsquolsquoname-space documentsrsquorsquo Thus in the general case performing rule-basedmaterialisation with respect to OWL semantics over arbitrary LinkedData requires some critical analysis of the source of data as per ourauthoritative reasoning algorithm Please see [9] for a detailed analysisof the explosion in inferences that occurs when authoritative analysisis not applied

Such observations prompted us to investigate more robustforms of reasoning Along these lines when reasoning over LinkedData we apply an algorithm called authoritative reasoning[143133] which critically examines the source of terminologicaltriples and conservatively rejects those that cannot be definitivelytrusted ie those that redefine classes andor properties outside ofthe namespace document

6 viz httpwwweiaonetrdf10

Dereferencing and authority We first give the functionhttpU 2G that maps a URI to an RDF graph it returns upon aHTTP lookup that returns the response code 200 Okay in the caseof failure the function maps to an empty RDF graph Next we de-fine the function redirU U that follows a (finite) sequence ofHTTP redirects given upon lookup of a URI or that returns theURI itself in the absence of a redirect or in the case of failure thisfunction first strips the fragment identifier (given after the lsquorsquo char-acter) from the original URI if present Then we give the dereferenc-ing function derefhttpredir as the mapping from a URI (a Weblocation) to an RDF graph it may provide by means of a given HTTPlookup that follows redirects Finally we can define the authorityfunctionmdashthat maps the URI of a Web document to the set of RDFterms it is authoritative formdashas follows

auth U 2C

u fc 2 U j redirethcTHORN frac14 ug [ BethhttpethsTHORNTHORNwhere B(G) denotes the set of blank-nodes appearing in a graph GIn other words a Web document is authoritative for URIs that dere-ference to it and the blank nodes it contains

We note that making RDF vocabulary URIs dereferenceable isencouraged in various best-practices [454] To enforce authorita-tive reasoning when applying rules with terminological and asser-tional patterns in the body we require that terminological triplesare served by a document authoritative for terms that are boundby specific variable positions in the rule

Authoritative T-split rule application Given a T-split rule and arule application involving a mapping l as before for the rule appli-cation to be authoritative there must additionally exist a l(v) suchthat v 2 VethAnteT THORN VethAnteGTHORN l(v) 2 auth(u) lethAnteT THORN httpethuTHORNWe call the set of variables that appear in both the terminologicaland assertional segment of the rule body (ie VethAnteT THORN VethAnteGTHORN) the set of authoritative variables for the rule Where a ruledoes not have terminological or assertional patterns we considerany rule-application as authoritative The notation of an authorita-tive T-ground rule follows likewise

Example Let Rprp-ifp denote the same rule as used in the previousexample and let the following two triples be given by the derefer-enced document of v1isbn but not v2mboxmdashie a documentauthoritative for the former term but not the latter

(v1isbn a owlInverseFunctionalProperty)(v2mbox a owlInverseFunctionalProperty)

The only authoritative variable appearing in both the termino-logical and assertional patterns of the body of Rprp-ifp is p boundby v1isbn and v2mbox above Since the document serving thetwo triples is authoritative for v1isbn the first triple is consid-ered authoritative and the following T-ground rule will begenerated

(x1 v1isbn y)(x2 v1isbn y))(x1 owlsameAs x2)

However since the document is not authoritative for v2mboxthe second T-ground rule will be considered non-authoritative anddiscarded

(x1 v2mbox y)(x2 v2mbox y))(x1 owlsameAs x2)

where third-party documents re-define remote terms in a way thataffects inferencing over instance data we filter such non-authoritative definitions

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 81

26 OWL 2 RLRDF rules

Inference rules can be used to (partially) support the semanticsof RDFS and OWL Along these lines the OWL 2 RLRDF [24] rulesetis a partial-axiomatisation of the OWL 2 RDF-Based Semanticswhich is applicable for arbitrary RDF graphs and constitutes anextension of the RDF Semantics [28] in other words the rulesetsupports a standardised profile of reasoning that partially coversthe complete OWL semantics Interestingly for our scenario thisprofile includes inference rules that support the semantics andinference of owlsameAs as discussed

First in Table 1 we provide the set of OWL 2 RLRDF rules thatsupport the (positive) semantics of owl sameAs axiomatising thesymmetry (rule eq-sym) and transitivity (rule eq-trans) of therelation To take an example rule eq-sym is as follows

(x owlsameAs y))(y owlsameAs x)

where if we know that (a owlsameAs b) the rule will inferthe symmetric relation (b owlsameAs a) Rule eq-trans oper-ates analogously both together allow for computing the transitivesymmetric closure of the equivalence relation Furthermore Table 1also contains the OWL 2 RLRDF rules that support the semantics ofreplacement (rules eq-rep-) whereby data that holds for one en-tity must also hold for equivalent entities

Note that we (optionally and in the case of later evaluation)choose not to support

(i) the reflexive semantics of owlsameAs since reflexiveowlsameAs statements will not lead to any consolidationor other non-reflexive equality relations and will produce alarge bulk of materialisations

(ii) equality on predicates or values for rdftype where we donot want possibly imprecise owlsameAs data to affectterms in these positions

Given that the semantics of equality is quadratic with respect tothe A-Box we apply a partial-materialisation approach that givesour notion of consolidation instead of materialising (ie explicitlywriting down) all possible inferences given by the semantics ofreplacement we instead choose one canonical identifier to repre-sent the set of equivalent terms thus effectively compressing thedata We have used this approach in previous works [30ndash32] andit has also appeared in related works in the literature [706] as acommon-sense optimisation for handling data-level equality Totake an example in later evaluation (cf Table 10) we find a validequivalence class (set of equivalent entities) with 33052 mem-bers materialising all pairwise non-reflexive owlsameAs state-ments would infer more than 1 billion owlsameAs relations(330522 33052) = 1092401652 further assuming that eachentity appeared in on average eg two quadruples we would inferan additional 2 billion of massively duplicated data By choosinga single canonical identifier we would instead only materialise100 thousand statements

Note that although we only perform partial materialisation wedo not change the semantics of equality our methods are sound

Table 1Rules that support the positive semantics of owl sameAs

OWL2RL Antecedent ConsequentAssertional

eq-sym x owlsameAs y y owlsameAs x eq-trans x owlsameAs y y owlsameAs z x owlsameAs z eq-rep-s s owlsameAs s0 s p o s0 p o eq-rep-o o owlsameAs o0 s p o s p o0

with respect to OWL semantics In addition alongside the partiallymaterialised data we provide a set of consolidated owl sameAs

relations (containing all of the identifiers in each equivalence class)that can be used to lsquolsquobackward-chainrsquorsquo the full inferences possiblethrough replacement (as required) Thus we do not consider thecanonical identifier as somehow lsquodefinitiversquo or superceding theother identifiers but merely consider it as representing the equiva-lence class7

Finally herein we do not consider consolidation of literalsone may consider useful applications eg for canonicalisingdatatype literals but such discussion is out of the currentscope

As previously mentioned OWL 2 RLRDF also contains rules thatuse terminological knowledge (alongside assertional knowledge)to directly infer owl sameAs relations We enumerate these rulesin Table 2 note that we italicise the labels of rules supporting fea-tures new to OWL 2 and that we list authoritative variables inbold8

Applying only these OWL 2 RLRDF rules may miss inference ofsome owlsameAs statements For example consider the exampleRDF data given in Turtle syntax [3] as follows

From the FOAF Vocabulary

foafhomepage rdfssubPropertyOf

foafisPrimaryTopicOf

foafisPrimaryTopicOf owlinverseOf

foafprimaryTopic

foafisPrimaryTopicOf a

owlInverseFunctionalProperty

From Example Document A

exAaxel foafhomepage lthttppolleresnetgt From Example Document B

an

apclsvo

lthttppolleresnetgt foafprimaryTopic exBapolleres

Inferred through prp-spo1

exAaxel foafisPrimaryTopicOf lthttppolleresnetgt

Inferred through prp-invexBapolleres foafisPrimaryTopicOf lthttppolleresnetgt

Subsequently inferred through prp-ifpexAaxel owlsameAs exBapolleres

exBapolleres owlsameAs exAaxel

The example uses properties from the prominent FOAF vocabu-lary which is used for publishing personal profiles as RDF Here weadditionally need OWL 2 RLRDF rules prp-inv and prp-spo1mdashhandling standard owlinverseOf and rdfssubPropertyOf

inferencing respectivelymdashto infer the owlsameAs relationentailed by the data

Along these lines we support an extended set of OWL 2 RLRDFrules that contain precisely one assertional pattern and for whichwe have demonstrated a scalable implementation called SAORdesigned for Linked Data reasoning [33] These rules are listed inTable 3 and as per the previous example can generateadditional inferences that indirectly lead to the derivation of newowl sameAs data We will use these rules for entity consolidationin Section 5

Lastly as we discuss later (particularly in Section 7) Linked Datais inherently noisy and prone to mistakes Although our methods

7 We may optionally consider non-canonical blank-node identifiers as redundand discard them

8 As discussed later our current implementation requires at least one variable topear in all assertional patterns in the body and so we do not support prp-key and-maxqc3 however these two features are not commonly used in Linked Datacabularies [29]

t

Table 2OWL 2 RLRDF rules that directly produce owl sameAs relations we denote authoritative variables with bold and italicise the labels of rules requiring new OWL 2 constructs

OWL2RL Antecedent Consequent

Terminological Assertional

prp-fp p a owlFunctionalProperty x p y1 y2 y1 owlsameAs y2

prp-ifp p a owlInverseFunctionalProperty x1 p y x2 p y x1 owlsameAs x2

cls-maxc2 x owlmaxCardinality 1 u a x y1 owlsameAs y2

x owlonProperty p u p y1 y2

cls-maxqc4 x owlmaxQualifiedCardinality 1 u a x y1 owlsameAs y2

x owlonProperty p u p y1 y2

x owlonClass owl Thing

Table 3OWL 2 RLRDF rules containing precisely one assertional pattern in the body with authoritative variables in bold (cf[33])

OWL2RL Antecedent Consequent

Terminological Assertional

prp-dom p rdfsdomain c x p y x a cprp-rng p rdfsrange c x p y y a cprp-symp p a owlSymmetricProperty x p y y p xprp-spo1 p1 rdfssubPropertyOf p2 x p1 y x p2 yprp-eqp1 p1 owlequivalentProperty p2 x p1 y x p2 yprp-eqp2 p1 owlequivalentProperty p2 x p2 y x p1 yprp-inv1 p1 owlinverseOf p2 x p1 y y p2 xprp-inv2 p1 owlinverseOf p2 x p2 y y p1 xcls-int2 c owlintersectionOf (c1 cn) x a c x a c1 cn

cls-uni c owlunionOf (c1 ci cn) x a ci x a ccls-svf2 x owlsomeValuesFrom owlThing owlonProperty p u p v u a xcls-hv1 x owlhasValue y owlonProperty p u a x u p ycls-hv2 x owlhasValue y owlonProperty p u p y u a xcax-sco c1 rdfssubClassOf c2 x a c1 x a c2

cax-eqc1 c1 owlequivalentClass c2 x a c1 x a c2

cax-eqc2 c1 owlequivalentClass c2 x a c2 x a c1

Table 4OWL 2 RLRDF rules used to detect inconsistency that we currently support wedenote authoritative variables with bold

OWL2RL Antecedent

Terminological Assertional

eq-diff1 ndash x owlsameAs yx owldifferentFrom y

prp-irp p a owlIrreflexiveProperty x p xprp-asyp p a owlAsymmetricProperty x p it y y p xprp-pdw p1owlpropertyDisjointWith p2 x p1 y p2 yprp-adp x a owlAllDisjointProperties u pi y pj y (i ndash j)

x owlmembers (p1 pn)cls-com c1 owlcomplementOf c2 x a c1 c2

cls-maxc1 x owlmaxCardinality 0 u a x p yx owlonProperty p

cls-maxqc2 x owlmaxQualifiedCardinality 0 u a x p yx owlonProperty px owlonClass owlThing

cax-dw c1 owldisjointWith c2 x a c1c2cax-adc x a owlAllDisjointClasses z a ci cj (i ndash j)

x owlmembers (c1 cn)

82 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

are sound (ie correct) with respect to formal OWL semanticssuch noise in our input data may lead to unintended consequenceswhen applying reasoning which we wish to minimise andorrepair Relatedly OWL contains features that allow for detectingformal contradictionsmdashcalled inconsistenciesmdashin RDF data Anexample of an inconsistency would be where something is amember of two disjoint classes such as foafPerson andfoafOrganization formally the intersection of such disjointclasses should be empty and they should not share membersAlong these lines OWL 2 RLRDF contains rules (with the specialsymbol false in the head) that allow for detecting inconsistencyin an RDF graph We use a subset of these consistency-checkingOWL 2 RLRDF rules later in Section 7 to try to automatically detectand repair unintended owlsameAs inferences the supportedsubset is listed in Table 4

27 Distribution architecture

To help meet our scalability requirements our methods areimplemented on a shared-nothing distributed architecture [64]over a cluster of commodity hardware The distributed frameworkconsists of a master machine that orchestrates the given tasks andseveral slave machines that perform parts of the task in parallel

The master machine can instigate the following distributedoperations

ndash Scatter partition on-disk data using some local split functionand send each chunk to individual slave machines for subse-quent processing

ndash Run request the parallel execution of a task by the slavemachinesmdashsuch a task either involves processing of somedata local to the slave machine or the coordinate method(described later) for reorganising the data under analysis

ndash Gather gathers chunks of output data from the slave swarmand performs some local merge function over the data

ndash Flood broadcast global knowledge required by all slavemachines for a future task

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 83

The master machine provides input data to the slave swarmprovides the control logic required by the distributed task (com-mencing tasks coordinating timing ending tasks) gathers andlocally perform tasks on global knowledge that the slave machineswould otherwise have to replicate in parallel and transmits glob-ally required knowledge

The slave machines as well as performing tasks in parallel canperform the following distributed operation (at the behest of themaster machine)

ndash Coordinate local data on each slave machine is partitionedaccording to some split function with the chunks sent toindividual machines in parallel each slave machine alsogathers the incoming chunks in parallel using some mergefunction

The above operation allows slave machines to reorganise(splitsendgather) intermediary amongst themselves the coor-dinate operation could be replaced by a pair of gatherscatteroperations performed by the master machine but we wish toavoid the channelling of all intermediary data through onemachine

Note that herein we assume that the input corpus is evenlydistributed and split across the slave machines and that theslave machines have roughly even specifications that is we donot consider any special form of load balancing but insteadaim to have uniform machines processing comparable data-chunks

We note that there is the practical issue of the master ma-chine being idle waiting for the slaves and more critically thepotentially large cluster of slave machines waiting idle for themaster machine One could overcome idle times with maturetask-scheduling (eg interleaving jobs) and load-balancing Froman algorithmic point of view removing the central coordinationon the master machine may enable better distributability Onepossibility would be to allow the slave machines to duplicatethe aggregation of global knowledge in parallel although thiswould free up the master machine and would probably takeroughly the same time duplicating computation wastes resourcesthat could otherwise be exploited by eg interleaving jobs Asecond possibility would be to avoid the requirement for globalknowledge and to coordinate upon the larger corpus (eg a coor-dinate function hashing on the subject and object of the data orperhaps an adaptation of the SpeedDate routing strategy [51])Such decisions are heavily influenced by the scale of the taskto perform the percentage of knowledge that is globally requiredhow the input data are distributed how the output data shouldbe distributed and the nature of the cluster over which it shouldbe performed and the task-scheduling possible The distributedimplementation of our tasks are designed to exploit a relativelysmall percentage of global knowledge which is cheap to coordi-nate and we choose to avoidmdashinsofar as reasonablemdashduplicatingcomputation

28 Experimental setup

Our entire code-base is implemented on top of standard Javalibraries We instantiate the distributed architecture using JavaRMI libraries and using the lightweight open-source Java RMIIOpackage9 for streaming data for the network

9 httpopenhmssourceforgenetrmiio

10 We observe eg a max FTP transfer rate of 38MBs between machines11 Please see [32 Appendix A] for further statistics relating to this corpus

All of our evaluation is based on nine machines connected byGigabit ethernet10 each with uniform specifications viz 22 GHzOpteron x86-64 4 GB main memory 160 GB SATA hard-disks run-ning Java 160_12 on Debian 504

3 Experimental corpus

Later in this paper we discuss the performance and results ofapplying our methods over a corpus of 1118 billion quadruplesderived from an open-domain RDFXML crawl of 3985 millionweb documents in mid-May 2010 (detailed in [3229]) The crawlwas conducted in a breadth-first manner extracting URIs from allpositions of the RDF data Individual URI queues were assigned todifferent pay-level-domains (aka PLDs domains that require pay-ment eg deriie datagovuk) where we enforced a polite-ness policy of accessing a maximum of two URIs per PLD persecond URIs with the highest inlink count per each PLD queuewere polled first

With regards the resulting corpus of the 1118 billion quads1106 billion are unique and 947 million are unique triples Thedata contain 23 thousand unique predicates and 105 thousandunique class terms (terms in the object position of an rdftype

triple) In terms of diversity the corpus consists of RDF from 783pay-level-domains [32]

Note that further details about the parameters of the crawl andstatistics about the corpus are available in [3229]

Now we discuss the usage of terms in a data-level position vizterms in the subject position or object position of non-rdftypetriples11 Since we do not consider the consolidation of literals orschema-level concepts we focus on characterising blank-node andURI re-use in such data-level positions thus rendering a picture ofthe structure of the raw data

We found 2863 million unique terms of which 1654 million(578) were blank-nodes 921 million (322) were URIs and289 million (10) were literals With respect to literals each hadon average 9473 data-level occurrences (by definition all in theobject position)

With respect to blank-nodes each had on average 5233 data-level occurrences Each occurred on average 0995 times in the ob-ject position of a non-rdftype triple with 31 million (187) notoccurring in the object position conversely each occurred on aver-age 4239 times in the subject position of a triple with 69 thousand(004) not occurring in the subject position Thus we summarisethat almost all blank-nodes appear in both the subject position andobject position but occur most prevalently in the former Impor-tantly note that in our input blank-nodes cannot be re-used acrosssources

With respect to URIs each had on average 941 data-leveloccurrences (18 the average for blank-nodes) with 4399 aver-age appearances in the subject position and 501 appearances inthe object positionmdash1985 million (2155) did not appear in anobject position whilst 5791 million (6288) did not appear in asubject position

With respect to re-use across sources each URI had a data-leveloccurrence in on average 47 documents and 1008 PLDsmdash562million (6102) of URIs appeared in only one document and913 million (9913) only appeared in one PLD Also re-use of URIsacross documents was heavily weighted in favour of use in the ob-ject position URIs appeared in the subject position in on average1061 documents and 0346 PLDs for the object position of

84 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

non-rdftype triples URIs occurred in on average 3996 docu-ments and 0727 PLDs

The URI with the most data-level occurrences (166 million)was httpidentica the URI with the most re-use acrossdocuments (appearing in 1793 thousand documents) washttpcreativecommonsorglicensesby30 the URIwith the most re-use across PLDs (appearing in 80 different do-mains) was httpwwwldoddscomfoaffoaf-a-maticAlthough some URIs do enjoy widespread re-use across differentdocuments and domains in Figs 1 and 2 we give the distribu-tion of re-use of URIs across documents and across PLDs wherea powerndashlaw relationship is roughly evidentmdashagain the majorityof URIs only appear in one document (61) or in one PLD(99)

From this analysis we can conclude that with respect to data-level terms in our corpus

ndash blank-nodes which by their very nature cannot be re-usedacross documents are 18 more prevalent than URIs

ndash despite a smaller number of unique URIs each one is used in(probably coincidentally) 18 more triples

ndash unlike blank-nodes URIs commonly only appear in either a sub-ject position or an object position

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100 1000 10000 100000 1e+006

num

ber o

f UR

Is

number of documents mentioning URI

Fig 1 Distribution of URIs and the number of documents they appear in (in a data-position)

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100

num

ber o

f UR

Is

number of PLDs mentioning URI

Fig 2 Distribution of URIs and the number of PLDs they appear in (in a data-position)

ndash each URI is re-used on average in 47 documents but usuallyonly within the same domainmdashmost external re-use is in theobject position of a triple

ndash 99 of URIs appear in only one PLD

We can conclude that within our corpusmdashitself a general crawlfor RDFXML on the Webmdashwe find that there is only sparse re-use of data-level terms across sources and particularly acrossdomains

Finally for the purposes of demonstrating performance acrossvarying numbers of machines we extract a smaller corpus com-prising of 100 million quadruples from the full evaluation corpusWe extract the sub-corpus from the head of the raw corpus sincethe data are ordered by access time polling statements from thehead roughly emulates a smaller crawl of data which shouldensure eg that all well-linked vocabularies are containedtherein

4 Base-line consolidation

We now present the lsquolsquobase-linersquorsquo algorithm for consolidationthat consumes asserted owlsameAs relations in the dataLinked Data best-practices encourage the provision ofowlsameAs links between exporters that coin different URIsfor the same entities

lsquolsquoIt is common practice to use the owlsameAs property for statingthat another data source also provides information about a specificnon-information resourcersquorsquo [7]

We would thus expect there to be a significant amount of expli-cit owlsameAs data present in our corpusmdashprovided directly bythe Linked Data publishers themselvesmdashthat can be directly usedfor consolidation

To perform consolidation over such data the first step is toextract all explicit owlsameAs data from the corpus We mustthen compute the transitive and symmetric closure of the equiv-alence relation and build equivalence classes sets of coreferentidentifiers Note that the set of equivalence classes forms a parti-tion of coreferent identifiers where due to the transitive andsymmetric semantics of owlsameAs each identifier can only ap-pear in one such class Also note that we do not need to considersingleton equivalence classes that contain only one identifierconsolidation need not perform any action if an identifier isfound only to be coreferent with itself Once the closure of as-serted owlsameAs data has been computed we then need tobuild an index that maps identifiers to the equivalence class inwhich it is contained This index enables lookups of coreferentidentifiers Next in the consolidated data we would like to col-lapse the data mentioning identifiers in each equivalence classto instead use a single consistent canonical identifier for eachequivalence class we must thus choose a canonical identifier Fi-nally we can scan the entire corpus and using the equivalenceclass index rewrite the original identifiers to their canonincalform As an optional step we can also add links between eachcanonical identifier and its coreferent forms using an artificalowlsameAs relation in the output thus persisting the originalidentifiers

The distributed algorithm presented in this section is basedon an in-memory equivalence class closure and index and isthe current method of consolidation employed by the SWSE sys-tem described previously in [32] Herein we briefly reintroducethe approach from [32] where we also add new performanceevaluation over varying numbers of machines in the distributedsetup present more discussion and analysis of the use ofowlsameAs in our Linked Data corpus and manually evaluatethe precision of the approach for an extended sample of one

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 85

thousand coreferent pairs In particular this section serves as abaseline for comparison against the extended consolidation ap-proach that uses richer reasoning features explored later inSection 5

41 High-level approach

Based on the previous discussion the approach is straight-forward

(i) scan the corpus and separate out all asserted owlsameAs

relations from the main body of the corpus(ii) load these relations into an in-memory index that encodes

the transitive and symmetric semantics of owlsameAs(iii) for each equivalence class in the index choose a canonical

term(iv) scan the corpus again canonicalising any term in the

subject position or object position of an rdftype

triple

Thus we need only index a small subset of the corpusmdashowlsameAs statementsmdashand can apply consolidation by meansof two scans

The non-trivial aspects of the algorithm are given by theequality closure and index To perform the in-memory transi-tive symmetric closure we use a traditional unionndashfind algo-rithm [6640] for computing equivalence partitions where (i)equivalent elements are stored in common sets such that eachelement is only contained in one set (ii) a map provides a find

function for looking up which set an element belongs to and(iii) when new equivalent elements are found the sets they be-long to are unioned The process is based on an in-memory mapindex and is detailed in Algorithm 1 where

(i) the eqc labels refer to equivalence classes and s and o refer toRDF subject and object terms

(ii) the function mapget refers to an identifier lookup on the in-memory index that should return the intermediary equiva-lence class associated with that identifier (ie find)

(iii) the function mapput associates an identifier with a newequivalence class

The output of the algorithm is an in-memory map from identi-fiers to their respective equivalence class

Next we must choose a canonical term for each equivalenceclass we prefer URIs over blank-nodes thereafter choosing a termwith the lowest alphabetical ordering a canonical term is thusassociated to each equivalent set Once the equivalence index hasbeen finalised we re-scan the corpus and canonicalise the datausing the in-memory index to service lookups

42 Distributed approach

Again distribution of the approach is fairly intuitive as follows

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract owlsameAs relations

(ii) Gather gather all owlsameAs relations onto the mastermachine and build the in-memory equality index

(iii) Floodrun send the equality index (in its entirety) to eachslave machine and apply the consolidation scan in parallel

As we will see in the next section the most expensive meth-odsmdashinvolving the two scans of the main corpusmdashcan be con-ducted in parallel

Algorithm 1

Building equivalence map [32]

Require SAMEAS DATA SA1 map 2 for (sowlsameAso) 2 SA ^ s ndash o do3 eqcs mapget(s)4 eqco mapget(o)5 if eqcs = ^ eqco = then6 eqcs[o so7 mapput(seqcs[o)8 mapput(oeqcs[o)9 else if eqcs = then

10 add s to eqco

11 mapput(seqco)12 else if eqco = then13 add o to eqcs

14 mapput(oeqcs)15 else if eqcs ndash eqco then16 if jeqcsj gt jeqcoj17 add all eqco into eqcs

18 for eo 2 eqco do19 mapput(eoeqcs)20 end for21 else22 add all eqcs into eqco

23 for es 2 eqcs do24 mapput(eseqco)25 end for26 end if27 end if28 end for

43 Performance evaluation

We applied the distributed base-line consolidation over our cor-pus with the aformentioned procedure and setup The entire con-solidation process took 633 min with the bulk of time taken asfollows the first scan extracting owlsameAs statements took125 min with an average idle time for the servers of 11 s(14)mdashie on average the slave machines spent 14 of the timeidly waiting for peers to finish Transferring aggregating and load-ing the owlsameAs statements on the master machine took84 min The second scan rewriting the data according to thecanonical identifiers took in total 423 min with an average idletime of 647 s (25) for each machine at the end of the roundThe slower time for the second round is attributable to the extraoverhead of re-writing the GZip-compressed data to disk as op-posed to just reading

In the rightmost column of Table 5 (full-8) we give a break-down of the timing for the tasks over the full corpus using eightslave machines and one master machine Independent of the num-ber of slaves we note that the master machine required 85 min forcoordinating globally-required owlsameAs knowledge and thatthe rest of the task time is spent in embarrassingly parallel execu-tion (amenable to reduction by increasing the number of ma-chines) For our setup the slave machines were kept busy for onaverage 846 of the total task time of the idle time 87 wasspent waiting for the master to coordinate the owlsameAs dataand 13 was spent waiting for peers to finish their task due tosub-optimal load balancing The master machine spent 866 ofthe task idle waiting for the slaves to finish

In addition Table 5 also gives performance results for onemaster and varying numbers of slave machines (1248) over the

Table 5Breakdown of timing of distributed baseline consolidation

Category 1 2 4 8 Full-8

Min Min Min min Min

MasterTotal execution time 422 1000 223 1000 119 1000 67 1000 633 1000Executing 05 11 05 22 05 40 05 75 85 134

Aggregating owl sameAs 04 11 05 22 05 40 05 75 84 133Miscellaneous 01 01 01 04 01 01

Idle (waiting for slaves) 418 989 218 977 114 958 62 921 548 866

SlavesAvg Executing (total) 418 989 218 977 114 958 62 921 535 846

Extract owl sameAs 90 214 45 202 22 184 11 168 123 195Consolidate 327 775 171 769 88 735 47 705 412 651

Avg Idle 05 11 07 29 10 82 08 127 98 154Waiting for peers 00 00 01 07 05 40 03 47 13 20Waiting for master 05 11 05 23 05 42 05 79 85 134

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000

num

ber o

f cla

sses

equivalence class size

Fig 3 Distribution of sizes of equivalence classes on loglog scale (from [32])

1

10

100

1000

10000

100000

1e+006

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

Fig 4 Distribution of the number of PLDs per equivalence class on loglog scale

2 For example see the coreference results given by httpwwwrkbexplorercoma m e A s u r i = h t t p w w w a c m r k b e x p l o r e r c o m i d p e r s o n - 5 3 2 9 2 -2877d02973d0d01e8f29c7113776e7e which at the time of writing correspond to36 out of the 443 equivalent URIs found for Dieter Fensel

86 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

100 million quadruple corpus Note that in the table we use thesymbol to indicate a negligible value (0 lt lt 005) We see thatas the number of slave machines increases so too does the percent-age of time taken to aggregate global knowledge on the master ma-chine (the absolute time stays roughly stable) This indicates thatfor a high number of machines the aggregation of global knowl-edge will eventually become a bottleneck and an alternative distri-bution strategy may be preferable subject to further investigationOverall however we see that the total execution times roughlyhalve when the number of slave machines are doubled we observea 0528 0536 and 0561 reduction in total task executiontime moving from 1 slave machine to 2 2 to 4 and 4 to 8 respec-tively (here a value of 05 would indicate linear scaling out)

44 Results evaluation

We extracted 1193 million raw owlsameAs quadruples fromthe full corpus Of these however there were only 377 million un-ique triples After closure the data formed 216 million equivalenceclasses mentioning 575 million terms (624 of URIs)mdashan averageof 265 elements per equivalence class Of the 575 million termsonly 4156 were blank-nodes Fig 3 presents the distribution ofsizes of the equivalence classes where the largest equivalenceclass contains 8481 equivalent entities and 16 million (741)equivalence classes contain the minimum two equivalent identifi-ers Fig 4 shows a similar distribution but for the number of PLDsextracted from URIs in each equivalence class Interestingly themajority (571) of equivalence classes contained identifiers from

more than one PLD where the most diverse equivalence class con-tained URIs from 32 PLDs

Table 6 shows the canonical URIs for the largest five equiva-lence classes we manually inspected the results and showwhether or not the results were verified as correctincorrect In-deed results for class 1 and 2 were deemed incorrect due toover-use of owlsameAs for linking drug-related entities in theDailyMed and LinkedCT exporters Results 3 and 5 were verifiedas correct consolidation of prominent Semantic Web relatedauthors resp Dieter Fensel and Rudi Studermdashauthors are gi-ven many duplicate URIs by the RKBExplorer coreference index12

Result 4 contained URIs from various sites generally referring tothe United States mostly from DBPedia and LastFM With respectto the DBPedia URIs these (i) were equivalent but for capitilisationvariations or stop-words (ii) were variations of abbreviations or va-lid synonyms (iii) were different language versions (eg dbpediaEtats_Unis) (iv) were nicknames (eg dbpediaYankee_land)(v) were related but not equivalent (eg dbpediaAmerican_Civilization) (vi) were just noise (eg dbpediaLOL_Dean)

Besides the largest equivalence classes which we have seen areprone to errors perhaps due to the snowballing effect of the tran-sitive and symmetric closure we also manually evaluated a ran-dom sample of results For the sampling we take the closed

1

s24

Table 6Largest five equivalence classes (from [32])

Canonical term (lexically first in equivalence class) Size Correct

1 httpbio2rdforgdailymed_drugs1000 84812 httpbio2rdforgdailymed_drugs1042 8003 httpacmrkbexplorercomidperson-53292-22877d02973d0d01e8f29c7113776e7e 443 U

4 httpagame2teachcomddb61cae0e083f705f65944cc3bb3968ce3f3ab59-ge_1 353 U5 httpwwwacmrkbexplorercomidperson-236166-1b4ef5fdf4a5216256064c45a8923bc9 316 U

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 87

equivalence classes extracted from the full corpus pick an identi-fier (possibly non-canonical) at random and then randomly pickanother coreferent identifier from its equivalence class We thenretrieve the data associated for both identifiers and manually in-spect them to see if they are in fact equivalent For each pairwe then select one of four options SAME DIFFERENT UNCLEAR TRIVIALLY

SAME We used the UNCLEAR option sparingly and only where therewas not enough information to make an informed decision or fordifficult subjective choices note that where there was not enoughinformation in our corpus we tried to dereference more informa-tion to minimise use of UNCLEAR in terms of difficult subjectivechoices one example pair we marked unclear was dbpediaFolk-core and dbpediaExperimental_folk We used the TRIVIALLY

SAME option to indicate that an equivalence is purely lsquolsquosyntacticrsquorsquowhere one identifier has no information other than being the targetof a owlsameAs link although the identifiers are still coreferentwe distinguish this case since the equivalence is not so lsquolsquomeaning-fulrsquorsquo For the 1000 pairs we choose option TRIVIALLY SAME 661 times(661) SAME 301 times (301) DIFFERENT 28 times (28) and UN-

CLEAR 10 times (1) In summary our manual verification of thebaseline results puts the precision at 972 albeit with manysyntatic equivalences found Of those found to be incorrect manywere closely-related DBpedia resources that were linked to by thesame resource on the Freebase exporter an example of this wasdbpediaRock_Hudson and dbpediaRock_Hudson_filmogra-phy having owlsameAs links from the same Freebase conceptfbrock_hudson

Moving on in Table 7 we give the most frequently co-occurringPLD-pairs in our equivalence classes where datasets resident onthese domains are lsquolsquoheavilyrsquorsquo interlinked with owl sameAs rela-tions We italicise the indexes of inter-domain links that were ob-served to often be purely syntactic

With respect to consolidation identifiers in 786 million subjectpositions (7 of subject positions) and 232 million non-rdftype-object positions (26) were rewritten giving a total of1019 million positions rewritten (51 of total rewritable posi-tions) The average number of documents mentioning each URIrose slightly from 4691 to 4719 (a 06 increase) due to consoli-dation and the average number of PLDs also rose slightly from1005 to 1007 (a 02 increase)

Table 7Top 10 PLD pairs co-occurring in the equivalence classes with number of equivalenceclasses they co-occur in

PLD PLD Co-occur

1 rdfizecom uriburnercom 5066232 dbpediaorg freebasecom 1870543 bio2rdforg purlorg 1853924 locgov infolcauthorities

a 1666505 l3sde rkbexplorercom 1258426 bibsonomyorg l3sde 1258327 bibsonomyorg rkbexplorercom 1258308 dbpediaorg mpiide 998279 freebasecom mpiide 8961010 dbpediaorg umbelorg 65565

a In fact these were within the same PLD

5 Extended reasoning consolidation

We now look at extending the baseline approach to includemore expressive reasoning capabilities in particular using OWL 2RLRDF rules (introduced previously in Section 26) to infer novelowlsameAs relations that can then be used for consolidationalongside the explicit relations asserted by publishers

As described in Section 22 OWL provides various featuresmdashincluding functional properties inverse-functional properties andcertain cardinality restrictionsmdashthat allow for inferring novelowlsameAs data Such features could ease the burden on publish-ers of providing explicit owlsameAs mappings to coreferent iden-tifiers in remote naming schemes For example inverse-functionalproperties can be used in conjunction with legacy identificationschemesmdashsuch as ISBNs for books EAN UCC-13 or MPN for prod-ucts MAC addresses for network-enabled devices etcmdashto boot-strap identity on the Web of Data within certain domains suchidentification values can be encoded as simple datatype stringsthus bypassing the requirement for bespoke agreement or map-pings between URIs Also lsquolsquoinformation resourcesrsquorsquo with indigenousURIs can be used for indirectly identifying related resources towhich they have a functional mapping where examples includepersonal email-addresses personal homepages WIKIPEDIA articlesetc

Although Linked Data literature has not explicitly endorsed orencouraged such usage prominent grass-roots efforts publishingRDF on the Web rely (or have relied) on inverse-functional proper-ties for maintaining consistent identity For example in the FriendOf A Friend (FOAF) community a technique called smushing wasproposed to leverage such properties for identity serving as anearly precursor to methods described herein13

Versus the baseline consolidation approach we must first ap-ply some reasoning techniques to infer novel owlsameAs rela-tions Once we have the inferred owlsameAs data materialisedand merged with the asserted owlsameAs relations we can thenapply the same consolidation approach as per the baseline (i) ap-ply the transitive symmetric closure and generate the equivalenceclasses (ii) pick canonical identifiers for each class and (iii) re-write the corpus However in contrast to the baseline we expectthe extended volume of owlsameAs data to make in-memorystorage infeasible14 Thus in this section we avoid the need tostore equivalence information in-memory where we instead inves-tigate batch-processing methods for applying an on-disk closure ofthe equivalence relations and likewise on-disk methods for rewrit-ing data

Early versions of the local batch-processing [31] techniques andsome of the distributed reasoning techniques [33] are borrowedfrom previous works Herein we reformulate and combine thesetechniques into a complete solution for applying enhanced consol-idation in a distributed scalable manner over large-scale Linked

13 cf httpwikifoaf-projectorgwSmushing retr 2011012214 Further note that since all machines must have all owlsameAs information

adding more machines does not increase the capacity for handling more suchrelations the scalability of the baseline approach is limited by the machine with theleast in-memory capacity

88 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Data corpora We provide new evaluation over our 1118 billionquadruple evaluation corpus presenting new performance resultsover the full corpus and for a varying number of slave machinesWe also analyse in depth the results of applying the techniquesover our evaluation corpus analysing the applicability of ourmethods in such scenarios We contrast the results given by thebaseline and extended consolidation approaches and again manu-ally evaluate the results of the extended consolidation approach for1000 pairs found to be coreferent

51 High-level approach

In Table 2 we provide the pertinent rules for inferring new owl

sameAs relations from the data However after analysis of thedata we observed that no documents used the owlmaxQuali-

fiedCardinality construct required for the cls-maxqc rulesand that only one document defined one owlhasKey axiom15

involving properties with less than five occurrences in the datamdashhence we leave implementation of these rules for future work andnote that these new OWL 2 constructs have probably not yet had timeto find proper traction on the Web Thus on top of inferencing involv-ing explicit owlsameAs we are left with rule prp-fp which supportsthe semantics of properties typed owlFunctionalProperty andrule prp-ifp which supports the semantics of properties typedowlInverseFunctionalProperty and rule cls-maxc1 whichsupports the semantics of classes with a specified cardinality of 1for some defined property (a class restricted version of the func-tional-property inferencing)16

Thus we look at using OWL 2 RLRDF rules prp-fp prp-ifp andcls-maxc2 for inferring new owl sameAs relations between indi-vidualsmdashwe also support an additional rule that gives an exact car-dinality version of cls-maxc217 Herein we refer to these rules asconsolidation rules

We also investigate pre-applying more general OWL 2 RLRDFreasoning over the corpus to derive more complete results wherethe ruleset is available in Table 3 and is restricted to OWL 2 RLRDFrules with one assertional pattern [33] unlike the consolidationrules these general rules do not require the computation of asser-tional joins and thus are amenable to execution by means of a sin-gle-scan of the corpus We have demonstrated this profile ofreasoning to have good competency with respect to the features ofRDFS and OWL used in Linked Data [29] These general rules mayindirectly lead to the inference of additional owlsameAs relationsparticularly when combined with the consolidation rules of Table 2

Once we have used reasoning to derive novel owlsameAs

data we then apply an on-disk batch processing technique toclose the equivalence relation and to ultimately consolidate thecorpus

Thus our high-level approach is as follows

(i) extract relevant terminological data from the corpus(ii) bind the terminological patterns in the rules from this data

thus creating a larger set of general rules with only oneassertional pattern and identifying assertional patterns thatare useful for consolidation

(iii) apply general-rules over the corpus and buffer any inputinferred statements relevant for consolidation to a new file

(iv) derive the closure of owlsameAs statements from the con-solidation-relevant dataset

15 httphuemerlstadlernetrolerhrdf16 Note that we have presented non-distributed batch-processing execution of

these rules at a smaller scale (147 million quadruples) in [31]17 Exact cardinalities are disallowed in OWL 2 RL due to their effect on the formal

proposition of completeness underlying the profile but such considerations are mootin our scenario

(v) apply consolidation over the main corpus with respect to theclosed owlsameAs data

In Step (i) we extract terminological datamdashrequired for applica-tion of our rulesmdashfrom the main corpus As discussed in Section 25to help ensure robustness we apply authoritative reasoning inother words at this stage we discard any third-party terminologi-cal statements that affect inferencing over instance data for remoteclasses and properties

In Step (ii) we then compile the authoritative terminologicalstatements into the rules to generate a set of T-ground (purelyassertional) rules which can then be applied over the entirety ofthe corpus [33] As mentioned in Section 26 the T-ground generalrules only contain one assertional pattern and thus are amenableto execution by a single scan eg given the statement

homepage rdfssubPropertyOf isPrimaryTopicOf

we would generate a T-ground rulex homepage y

)x isPrimaryTopicOf y

which does not require the computation of joins We also bindthe terminological patterns of the consolidation rules in Table 2however these rules cannot be performed by means of a singlescan over the data since they contain multiple assertional pat-terns in the body Thus we instead extract triple patternssuch as

x isPrimaryTopicOf y

which may indicate data useful for consolidation (note that hereisPrimaryTopicOf is assumed to be inverse-functional)

In Step (iii) we then apply the T-ground general rules over thecorpus with a single scan Any input or inferred statements match-ing a consolidation-relevant patternmdashincluding any owlsameAs

statements foundmdashare buffered to a file for later join computationSubsequently in Step (iv) we must now compute the on-disk

canonicalised closure of the owlsameAs statements In particularwe mainly employ the following three on-disk primitives

(i) sequential scans of flat files containing line-delimitedtuples18

(ii) external-sorts where batches of statements are sorted inmemory the sorted batches written to disk and the sortedbatches merged to the final output and

(iii) merge-joins where multiple sets of data are sorted accordingto their required join position and subsequently scanned inan interleaving manner that aligns on the join position andwhere an in-memory join is applied for each individual joinelement

Using these primitives to compute the owlsameAs closureminimises the amount of main memory required where we havepresented similar approaches in [3031]

First assertional memberships of functional properties inverse-functional properties and cardinality restrictions (both propertiesand classes) are written to separate on-disk files For functional-property and cardinality reasoning a consistent join variable forthe assertional patterns is given by the subject position for in-verse-functional-property reasoning a join variable is given bythe object position19 Thus we can sort the former sets of data

8 These files are G-Zip compressed flat files of N-Triple-like syntax encodingrbitrary length tuples of RDF constants9 Although a predicate-position join is also available we prefer object-positionins that provide smaller batches of data for the in-memory join

1

a1

jo

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 89

according to subject and perform a merge-join by means of a linearscan thereafter the same procedure applies to the latter set sortingand merge-joining on the object position Applying merge-join scanswe produce new owlsameAs statements

Both the originally asserted and newly inferred owlsameAs

relations are similarly written to an on-disk file over which wenow wish to perform the canonicalised symmetrictransitive clo-sure We apply a similar method again leveraging external sortsand merge-joins to perform the computation (herein we sketchand point the interested reader to [31 Section 46]) In the follow-ing we use gt and lt to denote lexical ordering SA as a shortcut forowlsameAs a b c etc to denote members of U [ B such thata lt b lt c and define URIs to be lexically lower than blank nodesThe process is as follows

(i) we only materialise symmetric equality relations thatinvolve a (possibly intermediary) canonical term chosenby a lexical ordering given b SA a we materialise a SA bgiven a SA b SA c we materialise the relations a SA b aSA c and their inverses but do not materialise b SA c orits inverse

(ii) transitivity is supported by iterative merge-join scans

20 One could instead build an on-disk map for equivalence classes and pivoelements and follow a consolidation method similar to the previous section over theunordered data however we would expect such an on-disk index to have a low cachehit-rate given the nature of the data which would lead to many (slow) disk seeks Analternative approach might be to hash-partition the corpus and equality index overdifferent machines however this would require a non-trivial minimum amount omemory on the cluster

ndash in the scan if we find c SA a (sorted by object) and c SA d(sorted naturally) we infer a SA d and drop the non-canonical c SA d (and d SA c)

ndash at the end of the scan newly inferred triples are marked and merge-joined

into the main equality data any triples echoing an earlier inference are ignored dropped non-canonical statements are removed

ndash the process is then iterative in the next scan if we find dSA a and d SA e we infer a SA e and e SA a

ndash inferences will only occur if they involve a statementadded in the previous iteration ensuring that inferencesteps are not re-computed and that the computation willterminate

(iii) the above iterations stop when a fixpoint is reached andnothing new is inferred

(iv) the process of reaching a fixpoint is accelerated using avail-able main-memory to store a cache of partial equalitychains

The above steps follow well-known methods for transtive clo-sure computation modified to support the symmetry ofowlsameAs and to use adaptive canonicalisation in order toavoid quadratic output With respect to the last item we useAlgorithm 1 to derive lsquolsquobatchesrsquorsquo of in-memory equivalencesand when in-memory capacity is achieved we write thesebatches to disk and proceed with on-disk computation this isparticularly useful for computing the small number of longequality chains which would otherwise require sorts andmerge-joins over all of the canonical owlsameAs data currentlyderived and where the number of iterations would otherwise bethe length of the longest chain The result of this process is a setof canonicalised equality relations representing the symmetrictransitive closure

Next we briefly describe the process of canonicalising data withrespect to this on-disk equality closure where we again use exter-nal-sorts and merge-joins First we prune the owlsameAs indexto only maintain relations s1 SA s2 such that s1 gt s2 thus given s1

SA s2 we know that s2 is the canonical identifier and s1 is to berewritten We then sort the data according to the position thatwe wish to rewrite and perform a merge-join over both sets ofdata buffering the canonicalised data to an output file If we wantto rewrite multiple positions of a file of tuples (eg subject and

object) we must rewrite one position sort the results by the sec-ond position and then rewrite the second position20

Finally note that in the derivation of owlsameAs from the con-solidation rules prp-fp prp-ifp cax-maxc2 the overall processmay be iterative For example consider

dblpAxel foafisPrimaryTopicOf lthttppolleresnetgt

axelme foafisPrimaryTopicOf lthttpaxelderiiegt

lthttppolleresnetgt owlsameAs lthttpaxelderiiegt

from which the conclusion that dblpAxel is the same asaxelme holds We see that new owlsameAs relations (either as-serted or derived from the consolidation rules) may in turn lsquolsquoalignrsquorsquoterms in the join position of the consolidation rules leading to newequivalences Thus for deriving the final owlsameAs we require ahigher-level iterative process as follows

(i) initially apply the consolidation rules and append theresults to a file alongside the asserted owl sameAs state-ments found

(ii) apply the initial closure of the owlsameAs data(iii) then iteratively until no new owlsameAs inferences are

found

ndash rewrite the join positions of the on-disk files containing

the data for each consolidation rule according to the cur-rent owlsameAs data

ndash derive new owlsameAs inferences possible through theprevious rewriting for each consolidation rule

ndash re-derive the closure of the owlsameAs data includingthe new inferences

The final closed file of owlsameAs data can then be re-used torewrite the main corpus in two sorts and merge-join scans oversubject and object

52 Distributed approach

The distributed approach follows quite naturally from theprevious discussion As before we assume that the input dataare evenly pre-distributed over the slave machines (in anyarbitrary ordering) where we can then apply the followingprocess

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract relevant terminologicalknowledge

(ii) Gather gather terminological data onto the master machineand thereafter bind the terminological patterns of the gen-eralconsolidation rules

(iii) Flood flood the rules for reasoning and the consolidation-relevant patterns to all slave machines

(iv) Run apply reasoning and extract consolidation-relevantstatements from the input and inferred data

(v) Gather gather all consolidation statements onto the mastermachine then in parallel

t

f

Table 8Breakdown of timing of distributed extended consolidation wreasoning where identifies the masterslave tasks that run concurrently

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 4543 1000 2302 1000 1275 1000 813 1000 7404 1000Executing 299 66 303 131 325 255 301 370 2436 329

Aggregate Consolidation Relevant Data 43 09 45 20 47 37 46 57 299 40Closing owlsameAs

244 54 245 106 254 199 244 300 2112 285Miscellaneous 12 03 13 05 24 19 11 13 25 03

Idle (waiting for slaves) 4245 934 1999 869 950 745 512 630 4968 671

SlaveAvg Executing (total) 4489 988 2212 961 1116 875 565 694 5991 809

Extract Terminology 444 98 227 99 118 92 69 84 494 67Extract Consolidation Relevant Data 955 210 483 210 243 191 132 162 1379 186Initial Sort (by subject) 980 216 483 210 237 186 116 142 1338 181Consolidation 2110 464 1019 443 519 407 249 306 2780 375

Avg Idle 55 12 89 39 379 297 236 290 1413 191Waiting for peers 00 00 32 14 71 56 63 77 375 51Waiting for master 55 12 58 25 308 241 173 212 1038 140

90 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

ndash Local compute the closure of the consolidation rules andthe owlsameAs data on the master machine

ndash Run each slave machine sorts its fragment of the maincorpus according to natural order (spoc)

1 At least in terms of pure quantity However we do not give an indication of theuality or importance of those few equivalences we miss with this approximationhich may well be application specific

(vi) Flood send the closed owlsameAs data to the slavemachines once the distributed sort has been completed

(vii) Run each slave machine then rewrites the subjects of theirsegment of the corpus subsequently sorts the rewritten databy object and then rewrites the objects (of non-rdftypetriples) with respect to the closed owlsameAs data

53 Performance evaluation

Applying the above process to our 1118 billion quadruple corpustook 1234 h Extracting the terminological data took 114 h with anaverage idle time of 19 min (277) (one machine took 18 minlonger than the rest due to processing a large ontology containing2 million quadruples [32]) Merging and aggregating the termino-logical data took roughly 1 min Applying the reasoning andextracting the consolidation relevant statements took 234 h withan average idle time of 25 min (18) Aggregating and mergingthe consolidation relevant statements took 299 min Thereafterlocally computing the closure of the consolidation rules and theequality data took 352 h with the computation requiring two itera-tions overall (the minimum possiblemdashthe second iteration did notproduce any new results) Concurrent to the previous step the paral-lel sort of remote data by natural order took 233 h with an averageidle time of 6 min (43) Subsequent parallel consolidation of thedata took 48 h with 10 min (35) average idle timemdashof this 19of the time was spent consolidating the pre-sorted subjects 60of the time was spent sorting the rewritten data by object and21 of the time was spent consolidating the objects of the data

As before Table 8 summarises the timing of the task Focusing onthe performance for the entire corpus the master machine took406 h to coordinate global knowledge constituting the lower boundon time possible for the task to execute with respect to increasingmachines in our setupmdashin future it may be worthwhile to investigatedistributed strategies for computing the owlsameAs closure(which takes 285 of the total computation time) but for the mo-ment we mitigate the cost by concurrently running a sort on theslave machines thus keeping the slaves busy for 634 of the timetaken for this local aggregation step The slave machines were onaverage busy for 809 of the total task time of the idle time733 was spent waiting for the master machine to aggregate theconsolidation relevant data and to finish the closure of owl sameAs

data and the balance (267) was spent waiting for peers to finish(mostly during the extraction of terminological data)

With regards performance evaluation over the 100 million qua-druple corpus for a varying number of slave machines perhapsmost notably the average idle time of the slave machines increasesquite rapidly as the number of machines increases In particular by4 machines the slaves can perform the distributed sort-by-subjecttask faster than the master machine can perform the local owlsameAs closure The significant workload of the master machinewill eventually become a bottleneck as the number of slave ma-chines increases further This is also noticeable in terms of overalltask time the respective time savings as the number of slave ma-chines doubles (moving from 1 slave machine to 8) are 0507 0554 and 0638 respectively

Briefly we also ran the consolidation without the general rea-soning rules (Table 3) motivated earlier With respect to perfor-mance the main variations were given by (i) the extraction ofconsolidation relevant statementsmdashthis time directly extractedfrom explicit statements as opposed to explicit and inferred state-mentsmdashwhich took 154 min (11 of the time taken including thegeneral reasoning) with an average idle time of less than one min-ute (6 average idle time) (ii) local aggregation of the consolida-tion relevant statements took 17 min (569 of the time takenpreviously) (iii) local closure of the owlsameAs data took318 h (904 of the time taken previously) The total time savedequated to 28h (227) where 333 min were saved from coordi-nation on the master machine and 225 h were saved from parallelexecution on the slave machines

54 Results evaluation

In this section we present the results of the consolidationwhich included the general reasoning step in the extraction ofthe consolidation-relevant statements In fact we found that theonly major variation between the two approaches was in theamount of consolidation-relevant statements collected (discussedpresently) where the changes in the total amounts of coreferentidentifiers equivalence classes and canonicalised terms were infact negligible (lt01) Thus for our corpus extracting only as-serted consolidation-relevant statements offered a very closeapproximation of the extended reasoning approach21

2

qw

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 6: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 81

26 OWL 2 RLRDF rules

Inference rules can be used to (partially) support the semanticsof RDFS and OWL Along these lines the OWL 2 RLRDF [24] rulesetis a partial-axiomatisation of the OWL 2 RDF-Based Semanticswhich is applicable for arbitrary RDF graphs and constitutes anextension of the RDF Semantics [28] in other words the rulesetsupports a standardised profile of reasoning that partially coversthe complete OWL semantics Interestingly for our scenario thisprofile includes inference rules that support the semantics andinference of owlsameAs as discussed

First in Table 1 we provide the set of OWL 2 RLRDF rules thatsupport the (positive) semantics of owl sameAs axiomatising thesymmetry (rule eq-sym) and transitivity (rule eq-trans) of therelation To take an example rule eq-sym is as follows

(x owlsameAs y))(y owlsameAs x)

where if we know that (a owlsameAs b) the rule will inferthe symmetric relation (b owlsameAs a) Rule eq-trans oper-ates analogously both together allow for computing the transitivesymmetric closure of the equivalence relation Furthermore Table 1also contains the OWL 2 RLRDF rules that support the semantics ofreplacement (rules eq-rep-) whereby data that holds for one en-tity must also hold for equivalent entities

Note that we (optionally and in the case of later evaluation)choose not to support

(i) the reflexive semantics of owlsameAs since reflexiveowlsameAs statements will not lead to any consolidationor other non-reflexive equality relations and will produce alarge bulk of materialisations

(ii) equality on predicates or values for rdftype where we donot want possibly imprecise owlsameAs data to affectterms in these positions

Given that the semantics of equality is quadratic with respect tothe A-Box we apply a partial-materialisation approach that givesour notion of consolidation instead of materialising (ie explicitlywriting down) all possible inferences given by the semantics ofreplacement we instead choose one canonical identifier to repre-sent the set of equivalent terms thus effectively compressing thedata We have used this approach in previous works [30ndash32] andit has also appeared in related works in the literature [706] as acommon-sense optimisation for handling data-level equality Totake an example in later evaluation (cf Table 10) we find a validequivalence class (set of equivalent entities) with 33052 mem-bers materialising all pairwise non-reflexive owlsameAs state-ments would infer more than 1 billion owlsameAs relations(330522 33052) = 1092401652 further assuming that eachentity appeared in on average eg two quadruples we would inferan additional 2 billion of massively duplicated data By choosinga single canonical identifier we would instead only materialise100 thousand statements

Note that although we only perform partial materialisation wedo not change the semantics of equality our methods are sound

Table 1Rules that support the positive semantics of owl sameAs

OWL2RL Antecedent ConsequentAssertional

eq-sym x owlsameAs y y owlsameAs x eq-trans x owlsameAs y y owlsameAs z x owlsameAs z eq-rep-s s owlsameAs s0 s p o s0 p o eq-rep-o o owlsameAs o0 s p o s p o0

with respect to OWL semantics In addition alongside the partiallymaterialised data we provide a set of consolidated owl sameAs

relations (containing all of the identifiers in each equivalence class)that can be used to lsquolsquobackward-chainrsquorsquo the full inferences possiblethrough replacement (as required) Thus we do not consider thecanonical identifier as somehow lsquodefinitiversquo or superceding theother identifiers but merely consider it as representing the equiva-lence class7

Finally herein we do not consider consolidation of literalsone may consider useful applications eg for canonicalisingdatatype literals but such discussion is out of the currentscope

As previously mentioned OWL 2 RLRDF also contains rules thatuse terminological knowledge (alongside assertional knowledge)to directly infer owl sameAs relations We enumerate these rulesin Table 2 note that we italicise the labels of rules supporting fea-tures new to OWL 2 and that we list authoritative variables inbold8

Applying only these OWL 2 RLRDF rules may miss inference ofsome owlsameAs statements For example consider the exampleRDF data given in Turtle syntax [3] as follows

From the FOAF Vocabulary

foafhomepage rdfssubPropertyOf

foafisPrimaryTopicOf

foafisPrimaryTopicOf owlinverseOf

foafprimaryTopic

foafisPrimaryTopicOf a

owlInverseFunctionalProperty

From Example Document A

exAaxel foafhomepage lthttppolleresnetgt From Example Document B

an

apclsvo

lthttppolleresnetgt foafprimaryTopic exBapolleres

Inferred through prp-spo1

exAaxel foafisPrimaryTopicOf lthttppolleresnetgt

Inferred through prp-invexBapolleres foafisPrimaryTopicOf lthttppolleresnetgt

Subsequently inferred through prp-ifpexAaxel owlsameAs exBapolleres

exBapolleres owlsameAs exAaxel

The example uses properties from the prominent FOAF vocabu-lary which is used for publishing personal profiles as RDF Here weadditionally need OWL 2 RLRDF rules prp-inv and prp-spo1mdashhandling standard owlinverseOf and rdfssubPropertyOf

inferencing respectivelymdashto infer the owlsameAs relationentailed by the data

Along these lines we support an extended set of OWL 2 RLRDFrules that contain precisely one assertional pattern and for whichwe have demonstrated a scalable implementation called SAORdesigned for Linked Data reasoning [33] These rules are listed inTable 3 and as per the previous example can generateadditional inferences that indirectly lead to the derivation of newowl sameAs data We will use these rules for entity consolidationin Section 5

Lastly as we discuss later (particularly in Section 7) Linked Datais inherently noisy and prone to mistakes Although our methods

7 We may optionally consider non-canonical blank-node identifiers as redundand discard them

8 As discussed later our current implementation requires at least one variable topear in all assertional patterns in the body and so we do not support prp-key and-maxqc3 however these two features are not commonly used in Linked Datacabularies [29]

t

Table 2OWL 2 RLRDF rules that directly produce owl sameAs relations we denote authoritative variables with bold and italicise the labels of rules requiring new OWL 2 constructs

OWL2RL Antecedent Consequent

Terminological Assertional

prp-fp p a owlFunctionalProperty x p y1 y2 y1 owlsameAs y2

prp-ifp p a owlInverseFunctionalProperty x1 p y x2 p y x1 owlsameAs x2

cls-maxc2 x owlmaxCardinality 1 u a x y1 owlsameAs y2

x owlonProperty p u p y1 y2

cls-maxqc4 x owlmaxQualifiedCardinality 1 u a x y1 owlsameAs y2

x owlonProperty p u p y1 y2

x owlonClass owl Thing

Table 3OWL 2 RLRDF rules containing precisely one assertional pattern in the body with authoritative variables in bold (cf[33])

OWL2RL Antecedent Consequent

Terminological Assertional

prp-dom p rdfsdomain c x p y x a cprp-rng p rdfsrange c x p y y a cprp-symp p a owlSymmetricProperty x p y y p xprp-spo1 p1 rdfssubPropertyOf p2 x p1 y x p2 yprp-eqp1 p1 owlequivalentProperty p2 x p1 y x p2 yprp-eqp2 p1 owlequivalentProperty p2 x p2 y x p1 yprp-inv1 p1 owlinverseOf p2 x p1 y y p2 xprp-inv2 p1 owlinverseOf p2 x p2 y y p1 xcls-int2 c owlintersectionOf (c1 cn) x a c x a c1 cn

cls-uni c owlunionOf (c1 ci cn) x a ci x a ccls-svf2 x owlsomeValuesFrom owlThing owlonProperty p u p v u a xcls-hv1 x owlhasValue y owlonProperty p u a x u p ycls-hv2 x owlhasValue y owlonProperty p u p y u a xcax-sco c1 rdfssubClassOf c2 x a c1 x a c2

cax-eqc1 c1 owlequivalentClass c2 x a c1 x a c2

cax-eqc2 c1 owlequivalentClass c2 x a c2 x a c1

Table 4OWL 2 RLRDF rules used to detect inconsistency that we currently support wedenote authoritative variables with bold

OWL2RL Antecedent

Terminological Assertional

eq-diff1 ndash x owlsameAs yx owldifferentFrom y

prp-irp p a owlIrreflexiveProperty x p xprp-asyp p a owlAsymmetricProperty x p it y y p xprp-pdw p1owlpropertyDisjointWith p2 x p1 y p2 yprp-adp x a owlAllDisjointProperties u pi y pj y (i ndash j)

x owlmembers (p1 pn)cls-com c1 owlcomplementOf c2 x a c1 c2

cls-maxc1 x owlmaxCardinality 0 u a x p yx owlonProperty p

cls-maxqc2 x owlmaxQualifiedCardinality 0 u a x p yx owlonProperty px owlonClass owlThing

cax-dw c1 owldisjointWith c2 x a c1c2cax-adc x a owlAllDisjointClasses z a ci cj (i ndash j)

x owlmembers (c1 cn)

82 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

are sound (ie correct) with respect to formal OWL semanticssuch noise in our input data may lead to unintended consequenceswhen applying reasoning which we wish to minimise andorrepair Relatedly OWL contains features that allow for detectingformal contradictionsmdashcalled inconsistenciesmdashin RDF data Anexample of an inconsistency would be where something is amember of two disjoint classes such as foafPerson andfoafOrganization formally the intersection of such disjointclasses should be empty and they should not share membersAlong these lines OWL 2 RLRDF contains rules (with the specialsymbol false in the head) that allow for detecting inconsistencyin an RDF graph We use a subset of these consistency-checkingOWL 2 RLRDF rules later in Section 7 to try to automatically detectand repair unintended owlsameAs inferences the supportedsubset is listed in Table 4

27 Distribution architecture

To help meet our scalability requirements our methods areimplemented on a shared-nothing distributed architecture [64]over a cluster of commodity hardware The distributed frameworkconsists of a master machine that orchestrates the given tasks andseveral slave machines that perform parts of the task in parallel

The master machine can instigate the following distributedoperations

ndash Scatter partition on-disk data using some local split functionand send each chunk to individual slave machines for subse-quent processing

ndash Run request the parallel execution of a task by the slavemachinesmdashsuch a task either involves processing of somedata local to the slave machine or the coordinate method(described later) for reorganising the data under analysis

ndash Gather gathers chunks of output data from the slave swarmand performs some local merge function over the data

ndash Flood broadcast global knowledge required by all slavemachines for a future task

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 83

The master machine provides input data to the slave swarmprovides the control logic required by the distributed task (com-mencing tasks coordinating timing ending tasks) gathers andlocally perform tasks on global knowledge that the slave machineswould otherwise have to replicate in parallel and transmits glob-ally required knowledge

The slave machines as well as performing tasks in parallel canperform the following distributed operation (at the behest of themaster machine)

ndash Coordinate local data on each slave machine is partitionedaccording to some split function with the chunks sent toindividual machines in parallel each slave machine alsogathers the incoming chunks in parallel using some mergefunction

The above operation allows slave machines to reorganise(splitsendgather) intermediary amongst themselves the coor-dinate operation could be replaced by a pair of gatherscatteroperations performed by the master machine but we wish toavoid the channelling of all intermediary data through onemachine

Note that herein we assume that the input corpus is evenlydistributed and split across the slave machines and that theslave machines have roughly even specifications that is we donot consider any special form of load balancing but insteadaim to have uniform machines processing comparable data-chunks

We note that there is the practical issue of the master ma-chine being idle waiting for the slaves and more critically thepotentially large cluster of slave machines waiting idle for themaster machine One could overcome idle times with maturetask-scheduling (eg interleaving jobs) and load-balancing Froman algorithmic point of view removing the central coordinationon the master machine may enable better distributability Onepossibility would be to allow the slave machines to duplicatethe aggregation of global knowledge in parallel although thiswould free up the master machine and would probably takeroughly the same time duplicating computation wastes resourcesthat could otherwise be exploited by eg interleaving jobs Asecond possibility would be to avoid the requirement for globalknowledge and to coordinate upon the larger corpus (eg a coor-dinate function hashing on the subject and object of the data orperhaps an adaptation of the SpeedDate routing strategy [51])Such decisions are heavily influenced by the scale of the taskto perform the percentage of knowledge that is globally requiredhow the input data are distributed how the output data shouldbe distributed and the nature of the cluster over which it shouldbe performed and the task-scheduling possible The distributedimplementation of our tasks are designed to exploit a relativelysmall percentage of global knowledge which is cheap to coordi-nate and we choose to avoidmdashinsofar as reasonablemdashduplicatingcomputation

28 Experimental setup

Our entire code-base is implemented on top of standard Javalibraries We instantiate the distributed architecture using JavaRMI libraries and using the lightweight open-source Java RMIIOpackage9 for streaming data for the network

9 httpopenhmssourceforgenetrmiio

10 We observe eg a max FTP transfer rate of 38MBs between machines11 Please see [32 Appendix A] for further statistics relating to this corpus

All of our evaluation is based on nine machines connected byGigabit ethernet10 each with uniform specifications viz 22 GHzOpteron x86-64 4 GB main memory 160 GB SATA hard-disks run-ning Java 160_12 on Debian 504

3 Experimental corpus

Later in this paper we discuss the performance and results ofapplying our methods over a corpus of 1118 billion quadruplesderived from an open-domain RDFXML crawl of 3985 millionweb documents in mid-May 2010 (detailed in [3229]) The crawlwas conducted in a breadth-first manner extracting URIs from allpositions of the RDF data Individual URI queues were assigned todifferent pay-level-domains (aka PLDs domains that require pay-ment eg deriie datagovuk) where we enforced a polite-ness policy of accessing a maximum of two URIs per PLD persecond URIs with the highest inlink count per each PLD queuewere polled first

With regards the resulting corpus of the 1118 billion quads1106 billion are unique and 947 million are unique triples Thedata contain 23 thousand unique predicates and 105 thousandunique class terms (terms in the object position of an rdftype

triple) In terms of diversity the corpus consists of RDF from 783pay-level-domains [32]

Note that further details about the parameters of the crawl andstatistics about the corpus are available in [3229]

Now we discuss the usage of terms in a data-level position vizterms in the subject position or object position of non-rdftypetriples11 Since we do not consider the consolidation of literals orschema-level concepts we focus on characterising blank-node andURI re-use in such data-level positions thus rendering a picture ofthe structure of the raw data

We found 2863 million unique terms of which 1654 million(578) were blank-nodes 921 million (322) were URIs and289 million (10) were literals With respect to literals each hadon average 9473 data-level occurrences (by definition all in theobject position)

With respect to blank-nodes each had on average 5233 data-level occurrences Each occurred on average 0995 times in the ob-ject position of a non-rdftype triple with 31 million (187) notoccurring in the object position conversely each occurred on aver-age 4239 times in the subject position of a triple with 69 thousand(004) not occurring in the subject position Thus we summarisethat almost all blank-nodes appear in both the subject position andobject position but occur most prevalently in the former Impor-tantly note that in our input blank-nodes cannot be re-used acrosssources

With respect to URIs each had on average 941 data-leveloccurrences (18 the average for blank-nodes) with 4399 aver-age appearances in the subject position and 501 appearances inthe object positionmdash1985 million (2155) did not appear in anobject position whilst 5791 million (6288) did not appear in asubject position

With respect to re-use across sources each URI had a data-leveloccurrence in on average 47 documents and 1008 PLDsmdash562million (6102) of URIs appeared in only one document and913 million (9913) only appeared in one PLD Also re-use of URIsacross documents was heavily weighted in favour of use in the ob-ject position URIs appeared in the subject position in on average1061 documents and 0346 PLDs for the object position of

84 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

non-rdftype triples URIs occurred in on average 3996 docu-ments and 0727 PLDs

The URI with the most data-level occurrences (166 million)was httpidentica the URI with the most re-use acrossdocuments (appearing in 1793 thousand documents) washttpcreativecommonsorglicensesby30 the URIwith the most re-use across PLDs (appearing in 80 different do-mains) was httpwwwldoddscomfoaffoaf-a-maticAlthough some URIs do enjoy widespread re-use across differentdocuments and domains in Figs 1 and 2 we give the distribu-tion of re-use of URIs across documents and across PLDs wherea powerndashlaw relationship is roughly evidentmdashagain the majorityof URIs only appear in one document (61) or in one PLD(99)

From this analysis we can conclude that with respect to data-level terms in our corpus

ndash blank-nodes which by their very nature cannot be re-usedacross documents are 18 more prevalent than URIs

ndash despite a smaller number of unique URIs each one is used in(probably coincidentally) 18 more triples

ndash unlike blank-nodes URIs commonly only appear in either a sub-ject position or an object position

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100 1000 10000 100000 1e+006

num

ber o

f UR

Is

number of documents mentioning URI

Fig 1 Distribution of URIs and the number of documents they appear in (in a data-position)

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100

num

ber o

f UR

Is

number of PLDs mentioning URI

Fig 2 Distribution of URIs and the number of PLDs they appear in (in a data-position)

ndash each URI is re-used on average in 47 documents but usuallyonly within the same domainmdashmost external re-use is in theobject position of a triple

ndash 99 of URIs appear in only one PLD

We can conclude that within our corpusmdashitself a general crawlfor RDFXML on the Webmdashwe find that there is only sparse re-use of data-level terms across sources and particularly acrossdomains

Finally for the purposes of demonstrating performance acrossvarying numbers of machines we extract a smaller corpus com-prising of 100 million quadruples from the full evaluation corpusWe extract the sub-corpus from the head of the raw corpus sincethe data are ordered by access time polling statements from thehead roughly emulates a smaller crawl of data which shouldensure eg that all well-linked vocabularies are containedtherein

4 Base-line consolidation

We now present the lsquolsquobase-linersquorsquo algorithm for consolidationthat consumes asserted owlsameAs relations in the dataLinked Data best-practices encourage the provision ofowlsameAs links between exporters that coin different URIsfor the same entities

lsquolsquoIt is common practice to use the owlsameAs property for statingthat another data source also provides information about a specificnon-information resourcersquorsquo [7]

We would thus expect there to be a significant amount of expli-cit owlsameAs data present in our corpusmdashprovided directly bythe Linked Data publishers themselvesmdashthat can be directly usedfor consolidation

To perform consolidation over such data the first step is toextract all explicit owlsameAs data from the corpus We mustthen compute the transitive and symmetric closure of the equiv-alence relation and build equivalence classes sets of coreferentidentifiers Note that the set of equivalence classes forms a parti-tion of coreferent identifiers where due to the transitive andsymmetric semantics of owlsameAs each identifier can only ap-pear in one such class Also note that we do not need to considersingleton equivalence classes that contain only one identifierconsolidation need not perform any action if an identifier isfound only to be coreferent with itself Once the closure of as-serted owlsameAs data has been computed we then need tobuild an index that maps identifiers to the equivalence class inwhich it is contained This index enables lookups of coreferentidentifiers Next in the consolidated data we would like to col-lapse the data mentioning identifiers in each equivalence classto instead use a single consistent canonical identifier for eachequivalence class we must thus choose a canonical identifier Fi-nally we can scan the entire corpus and using the equivalenceclass index rewrite the original identifiers to their canonincalform As an optional step we can also add links between eachcanonical identifier and its coreferent forms using an artificalowlsameAs relation in the output thus persisting the originalidentifiers

The distributed algorithm presented in this section is basedon an in-memory equivalence class closure and index and isthe current method of consolidation employed by the SWSE sys-tem described previously in [32] Herein we briefly reintroducethe approach from [32] where we also add new performanceevaluation over varying numbers of machines in the distributedsetup present more discussion and analysis of the use ofowlsameAs in our Linked Data corpus and manually evaluatethe precision of the approach for an extended sample of one

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 85

thousand coreferent pairs In particular this section serves as abaseline for comparison against the extended consolidation ap-proach that uses richer reasoning features explored later inSection 5

41 High-level approach

Based on the previous discussion the approach is straight-forward

(i) scan the corpus and separate out all asserted owlsameAs

relations from the main body of the corpus(ii) load these relations into an in-memory index that encodes

the transitive and symmetric semantics of owlsameAs(iii) for each equivalence class in the index choose a canonical

term(iv) scan the corpus again canonicalising any term in the

subject position or object position of an rdftype

triple

Thus we need only index a small subset of the corpusmdashowlsameAs statementsmdashand can apply consolidation by meansof two scans

The non-trivial aspects of the algorithm are given by theequality closure and index To perform the in-memory transi-tive symmetric closure we use a traditional unionndashfind algo-rithm [6640] for computing equivalence partitions where (i)equivalent elements are stored in common sets such that eachelement is only contained in one set (ii) a map provides a find

function for looking up which set an element belongs to and(iii) when new equivalent elements are found the sets they be-long to are unioned The process is based on an in-memory mapindex and is detailed in Algorithm 1 where

(i) the eqc labels refer to equivalence classes and s and o refer toRDF subject and object terms

(ii) the function mapget refers to an identifier lookup on the in-memory index that should return the intermediary equiva-lence class associated with that identifier (ie find)

(iii) the function mapput associates an identifier with a newequivalence class

The output of the algorithm is an in-memory map from identi-fiers to their respective equivalence class

Next we must choose a canonical term for each equivalenceclass we prefer URIs over blank-nodes thereafter choosing a termwith the lowest alphabetical ordering a canonical term is thusassociated to each equivalent set Once the equivalence index hasbeen finalised we re-scan the corpus and canonicalise the datausing the in-memory index to service lookups

42 Distributed approach

Again distribution of the approach is fairly intuitive as follows

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract owlsameAs relations

(ii) Gather gather all owlsameAs relations onto the mastermachine and build the in-memory equality index

(iii) Floodrun send the equality index (in its entirety) to eachslave machine and apply the consolidation scan in parallel

As we will see in the next section the most expensive meth-odsmdashinvolving the two scans of the main corpusmdashcan be con-ducted in parallel

Algorithm 1

Building equivalence map [32]

Require SAMEAS DATA SA1 map 2 for (sowlsameAso) 2 SA ^ s ndash o do3 eqcs mapget(s)4 eqco mapget(o)5 if eqcs = ^ eqco = then6 eqcs[o so7 mapput(seqcs[o)8 mapput(oeqcs[o)9 else if eqcs = then

10 add s to eqco

11 mapput(seqco)12 else if eqco = then13 add o to eqcs

14 mapput(oeqcs)15 else if eqcs ndash eqco then16 if jeqcsj gt jeqcoj17 add all eqco into eqcs

18 for eo 2 eqco do19 mapput(eoeqcs)20 end for21 else22 add all eqcs into eqco

23 for es 2 eqcs do24 mapput(eseqco)25 end for26 end if27 end if28 end for

43 Performance evaluation

We applied the distributed base-line consolidation over our cor-pus with the aformentioned procedure and setup The entire con-solidation process took 633 min with the bulk of time taken asfollows the first scan extracting owlsameAs statements took125 min with an average idle time for the servers of 11 s(14)mdashie on average the slave machines spent 14 of the timeidly waiting for peers to finish Transferring aggregating and load-ing the owlsameAs statements on the master machine took84 min The second scan rewriting the data according to thecanonical identifiers took in total 423 min with an average idletime of 647 s (25) for each machine at the end of the roundThe slower time for the second round is attributable to the extraoverhead of re-writing the GZip-compressed data to disk as op-posed to just reading

In the rightmost column of Table 5 (full-8) we give a break-down of the timing for the tasks over the full corpus using eightslave machines and one master machine Independent of the num-ber of slaves we note that the master machine required 85 min forcoordinating globally-required owlsameAs knowledge and thatthe rest of the task time is spent in embarrassingly parallel execu-tion (amenable to reduction by increasing the number of ma-chines) For our setup the slave machines were kept busy for onaverage 846 of the total task time of the idle time 87 wasspent waiting for the master to coordinate the owlsameAs dataand 13 was spent waiting for peers to finish their task due tosub-optimal load balancing The master machine spent 866 ofthe task idle waiting for the slaves to finish

In addition Table 5 also gives performance results for onemaster and varying numbers of slave machines (1248) over the

Table 5Breakdown of timing of distributed baseline consolidation

Category 1 2 4 8 Full-8

Min Min Min min Min

MasterTotal execution time 422 1000 223 1000 119 1000 67 1000 633 1000Executing 05 11 05 22 05 40 05 75 85 134

Aggregating owl sameAs 04 11 05 22 05 40 05 75 84 133Miscellaneous 01 01 01 04 01 01

Idle (waiting for slaves) 418 989 218 977 114 958 62 921 548 866

SlavesAvg Executing (total) 418 989 218 977 114 958 62 921 535 846

Extract owl sameAs 90 214 45 202 22 184 11 168 123 195Consolidate 327 775 171 769 88 735 47 705 412 651

Avg Idle 05 11 07 29 10 82 08 127 98 154Waiting for peers 00 00 01 07 05 40 03 47 13 20Waiting for master 05 11 05 23 05 42 05 79 85 134

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000

num

ber o

f cla

sses

equivalence class size

Fig 3 Distribution of sizes of equivalence classes on loglog scale (from [32])

1

10

100

1000

10000

100000

1e+006

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

Fig 4 Distribution of the number of PLDs per equivalence class on loglog scale

2 For example see the coreference results given by httpwwwrkbexplorercoma m e A s u r i = h t t p w w w a c m r k b e x p l o r e r c o m i d p e r s o n - 5 3 2 9 2 -2877d02973d0d01e8f29c7113776e7e which at the time of writing correspond to36 out of the 443 equivalent URIs found for Dieter Fensel

86 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

100 million quadruple corpus Note that in the table we use thesymbol to indicate a negligible value (0 lt lt 005) We see thatas the number of slave machines increases so too does the percent-age of time taken to aggregate global knowledge on the master ma-chine (the absolute time stays roughly stable) This indicates thatfor a high number of machines the aggregation of global knowl-edge will eventually become a bottleneck and an alternative distri-bution strategy may be preferable subject to further investigationOverall however we see that the total execution times roughlyhalve when the number of slave machines are doubled we observea 0528 0536 and 0561 reduction in total task executiontime moving from 1 slave machine to 2 2 to 4 and 4 to 8 respec-tively (here a value of 05 would indicate linear scaling out)

44 Results evaluation

We extracted 1193 million raw owlsameAs quadruples fromthe full corpus Of these however there were only 377 million un-ique triples After closure the data formed 216 million equivalenceclasses mentioning 575 million terms (624 of URIs)mdashan averageof 265 elements per equivalence class Of the 575 million termsonly 4156 were blank-nodes Fig 3 presents the distribution ofsizes of the equivalence classes where the largest equivalenceclass contains 8481 equivalent entities and 16 million (741)equivalence classes contain the minimum two equivalent identifi-ers Fig 4 shows a similar distribution but for the number of PLDsextracted from URIs in each equivalence class Interestingly themajority (571) of equivalence classes contained identifiers from

more than one PLD where the most diverse equivalence class con-tained URIs from 32 PLDs

Table 6 shows the canonical URIs for the largest five equiva-lence classes we manually inspected the results and showwhether or not the results were verified as correctincorrect In-deed results for class 1 and 2 were deemed incorrect due toover-use of owlsameAs for linking drug-related entities in theDailyMed and LinkedCT exporters Results 3 and 5 were verifiedas correct consolidation of prominent Semantic Web relatedauthors resp Dieter Fensel and Rudi Studermdashauthors are gi-ven many duplicate URIs by the RKBExplorer coreference index12

Result 4 contained URIs from various sites generally referring tothe United States mostly from DBPedia and LastFM With respectto the DBPedia URIs these (i) were equivalent but for capitilisationvariations or stop-words (ii) were variations of abbreviations or va-lid synonyms (iii) were different language versions (eg dbpediaEtats_Unis) (iv) were nicknames (eg dbpediaYankee_land)(v) were related but not equivalent (eg dbpediaAmerican_Civilization) (vi) were just noise (eg dbpediaLOL_Dean)

Besides the largest equivalence classes which we have seen areprone to errors perhaps due to the snowballing effect of the tran-sitive and symmetric closure we also manually evaluated a ran-dom sample of results For the sampling we take the closed

1

s24

Table 6Largest five equivalence classes (from [32])

Canonical term (lexically first in equivalence class) Size Correct

1 httpbio2rdforgdailymed_drugs1000 84812 httpbio2rdforgdailymed_drugs1042 8003 httpacmrkbexplorercomidperson-53292-22877d02973d0d01e8f29c7113776e7e 443 U

4 httpagame2teachcomddb61cae0e083f705f65944cc3bb3968ce3f3ab59-ge_1 353 U5 httpwwwacmrkbexplorercomidperson-236166-1b4ef5fdf4a5216256064c45a8923bc9 316 U

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 87

equivalence classes extracted from the full corpus pick an identi-fier (possibly non-canonical) at random and then randomly pickanother coreferent identifier from its equivalence class We thenretrieve the data associated for both identifiers and manually in-spect them to see if they are in fact equivalent For each pairwe then select one of four options SAME DIFFERENT UNCLEAR TRIVIALLY

SAME We used the UNCLEAR option sparingly and only where therewas not enough information to make an informed decision or fordifficult subjective choices note that where there was not enoughinformation in our corpus we tried to dereference more informa-tion to minimise use of UNCLEAR in terms of difficult subjectivechoices one example pair we marked unclear was dbpediaFolk-core and dbpediaExperimental_folk We used the TRIVIALLY

SAME option to indicate that an equivalence is purely lsquolsquosyntacticrsquorsquowhere one identifier has no information other than being the targetof a owlsameAs link although the identifiers are still coreferentwe distinguish this case since the equivalence is not so lsquolsquomeaning-fulrsquorsquo For the 1000 pairs we choose option TRIVIALLY SAME 661 times(661) SAME 301 times (301) DIFFERENT 28 times (28) and UN-

CLEAR 10 times (1) In summary our manual verification of thebaseline results puts the precision at 972 albeit with manysyntatic equivalences found Of those found to be incorrect manywere closely-related DBpedia resources that were linked to by thesame resource on the Freebase exporter an example of this wasdbpediaRock_Hudson and dbpediaRock_Hudson_filmogra-phy having owlsameAs links from the same Freebase conceptfbrock_hudson

Moving on in Table 7 we give the most frequently co-occurringPLD-pairs in our equivalence classes where datasets resident onthese domains are lsquolsquoheavilyrsquorsquo interlinked with owl sameAs rela-tions We italicise the indexes of inter-domain links that were ob-served to often be purely syntactic

With respect to consolidation identifiers in 786 million subjectpositions (7 of subject positions) and 232 million non-rdftype-object positions (26) were rewritten giving a total of1019 million positions rewritten (51 of total rewritable posi-tions) The average number of documents mentioning each URIrose slightly from 4691 to 4719 (a 06 increase) due to consoli-dation and the average number of PLDs also rose slightly from1005 to 1007 (a 02 increase)

Table 7Top 10 PLD pairs co-occurring in the equivalence classes with number of equivalenceclasses they co-occur in

PLD PLD Co-occur

1 rdfizecom uriburnercom 5066232 dbpediaorg freebasecom 1870543 bio2rdforg purlorg 1853924 locgov infolcauthorities

a 1666505 l3sde rkbexplorercom 1258426 bibsonomyorg l3sde 1258327 bibsonomyorg rkbexplorercom 1258308 dbpediaorg mpiide 998279 freebasecom mpiide 8961010 dbpediaorg umbelorg 65565

a In fact these were within the same PLD

5 Extended reasoning consolidation

We now look at extending the baseline approach to includemore expressive reasoning capabilities in particular using OWL 2RLRDF rules (introduced previously in Section 26) to infer novelowlsameAs relations that can then be used for consolidationalongside the explicit relations asserted by publishers

As described in Section 22 OWL provides various featuresmdashincluding functional properties inverse-functional properties andcertain cardinality restrictionsmdashthat allow for inferring novelowlsameAs data Such features could ease the burden on publish-ers of providing explicit owlsameAs mappings to coreferent iden-tifiers in remote naming schemes For example inverse-functionalproperties can be used in conjunction with legacy identificationschemesmdashsuch as ISBNs for books EAN UCC-13 or MPN for prod-ucts MAC addresses for network-enabled devices etcmdashto boot-strap identity on the Web of Data within certain domains suchidentification values can be encoded as simple datatype stringsthus bypassing the requirement for bespoke agreement or map-pings between URIs Also lsquolsquoinformation resourcesrsquorsquo with indigenousURIs can be used for indirectly identifying related resources towhich they have a functional mapping where examples includepersonal email-addresses personal homepages WIKIPEDIA articlesetc

Although Linked Data literature has not explicitly endorsed orencouraged such usage prominent grass-roots efforts publishingRDF on the Web rely (or have relied) on inverse-functional proper-ties for maintaining consistent identity For example in the FriendOf A Friend (FOAF) community a technique called smushing wasproposed to leverage such properties for identity serving as anearly precursor to methods described herein13

Versus the baseline consolidation approach we must first ap-ply some reasoning techniques to infer novel owlsameAs rela-tions Once we have the inferred owlsameAs data materialisedand merged with the asserted owlsameAs relations we can thenapply the same consolidation approach as per the baseline (i) ap-ply the transitive symmetric closure and generate the equivalenceclasses (ii) pick canonical identifiers for each class and (iii) re-write the corpus However in contrast to the baseline we expectthe extended volume of owlsameAs data to make in-memorystorage infeasible14 Thus in this section we avoid the need tostore equivalence information in-memory where we instead inves-tigate batch-processing methods for applying an on-disk closure ofthe equivalence relations and likewise on-disk methods for rewrit-ing data

Early versions of the local batch-processing [31] techniques andsome of the distributed reasoning techniques [33] are borrowedfrom previous works Herein we reformulate and combine thesetechniques into a complete solution for applying enhanced consol-idation in a distributed scalable manner over large-scale Linked

13 cf httpwikifoaf-projectorgwSmushing retr 2011012214 Further note that since all machines must have all owlsameAs information

adding more machines does not increase the capacity for handling more suchrelations the scalability of the baseline approach is limited by the machine with theleast in-memory capacity

88 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Data corpora We provide new evaluation over our 1118 billionquadruple evaluation corpus presenting new performance resultsover the full corpus and for a varying number of slave machinesWe also analyse in depth the results of applying the techniquesover our evaluation corpus analysing the applicability of ourmethods in such scenarios We contrast the results given by thebaseline and extended consolidation approaches and again manu-ally evaluate the results of the extended consolidation approach for1000 pairs found to be coreferent

51 High-level approach

In Table 2 we provide the pertinent rules for inferring new owl

sameAs relations from the data However after analysis of thedata we observed that no documents used the owlmaxQuali-

fiedCardinality construct required for the cls-maxqc rulesand that only one document defined one owlhasKey axiom15

involving properties with less than five occurrences in the datamdashhence we leave implementation of these rules for future work andnote that these new OWL 2 constructs have probably not yet had timeto find proper traction on the Web Thus on top of inferencing involv-ing explicit owlsameAs we are left with rule prp-fp which supportsthe semantics of properties typed owlFunctionalProperty andrule prp-ifp which supports the semantics of properties typedowlInverseFunctionalProperty and rule cls-maxc1 whichsupports the semantics of classes with a specified cardinality of 1for some defined property (a class restricted version of the func-tional-property inferencing)16

Thus we look at using OWL 2 RLRDF rules prp-fp prp-ifp andcls-maxc2 for inferring new owl sameAs relations between indi-vidualsmdashwe also support an additional rule that gives an exact car-dinality version of cls-maxc217 Herein we refer to these rules asconsolidation rules

We also investigate pre-applying more general OWL 2 RLRDFreasoning over the corpus to derive more complete results wherethe ruleset is available in Table 3 and is restricted to OWL 2 RLRDFrules with one assertional pattern [33] unlike the consolidationrules these general rules do not require the computation of asser-tional joins and thus are amenable to execution by means of a sin-gle-scan of the corpus We have demonstrated this profile ofreasoning to have good competency with respect to the features ofRDFS and OWL used in Linked Data [29] These general rules mayindirectly lead to the inference of additional owlsameAs relationsparticularly when combined with the consolidation rules of Table 2

Once we have used reasoning to derive novel owlsameAs

data we then apply an on-disk batch processing technique toclose the equivalence relation and to ultimately consolidate thecorpus

Thus our high-level approach is as follows

(i) extract relevant terminological data from the corpus(ii) bind the terminological patterns in the rules from this data

thus creating a larger set of general rules with only oneassertional pattern and identifying assertional patterns thatare useful for consolidation

(iii) apply general-rules over the corpus and buffer any inputinferred statements relevant for consolidation to a new file

(iv) derive the closure of owlsameAs statements from the con-solidation-relevant dataset

15 httphuemerlstadlernetrolerhrdf16 Note that we have presented non-distributed batch-processing execution of

these rules at a smaller scale (147 million quadruples) in [31]17 Exact cardinalities are disallowed in OWL 2 RL due to their effect on the formal

proposition of completeness underlying the profile but such considerations are mootin our scenario

(v) apply consolidation over the main corpus with respect to theclosed owlsameAs data

In Step (i) we extract terminological datamdashrequired for applica-tion of our rulesmdashfrom the main corpus As discussed in Section 25to help ensure robustness we apply authoritative reasoning inother words at this stage we discard any third-party terminologi-cal statements that affect inferencing over instance data for remoteclasses and properties

In Step (ii) we then compile the authoritative terminologicalstatements into the rules to generate a set of T-ground (purelyassertional) rules which can then be applied over the entirety ofthe corpus [33] As mentioned in Section 26 the T-ground generalrules only contain one assertional pattern and thus are amenableto execution by a single scan eg given the statement

homepage rdfssubPropertyOf isPrimaryTopicOf

we would generate a T-ground rulex homepage y

)x isPrimaryTopicOf y

which does not require the computation of joins We also bindthe terminological patterns of the consolidation rules in Table 2however these rules cannot be performed by means of a singlescan over the data since they contain multiple assertional pat-terns in the body Thus we instead extract triple patternssuch as

x isPrimaryTopicOf y

which may indicate data useful for consolidation (note that hereisPrimaryTopicOf is assumed to be inverse-functional)

In Step (iii) we then apply the T-ground general rules over thecorpus with a single scan Any input or inferred statements match-ing a consolidation-relevant patternmdashincluding any owlsameAs

statements foundmdashare buffered to a file for later join computationSubsequently in Step (iv) we must now compute the on-disk

canonicalised closure of the owlsameAs statements In particularwe mainly employ the following three on-disk primitives

(i) sequential scans of flat files containing line-delimitedtuples18

(ii) external-sorts where batches of statements are sorted inmemory the sorted batches written to disk and the sortedbatches merged to the final output and

(iii) merge-joins where multiple sets of data are sorted accordingto their required join position and subsequently scanned inan interleaving manner that aligns on the join position andwhere an in-memory join is applied for each individual joinelement

Using these primitives to compute the owlsameAs closureminimises the amount of main memory required where we havepresented similar approaches in [3031]

First assertional memberships of functional properties inverse-functional properties and cardinality restrictions (both propertiesand classes) are written to separate on-disk files For functional-property and cardinality reasoning a consistent join variable forthe assertional patterns is given by the subject position for in-verse-functional-property reasoning a join variable is given bythe object position19 Thus we can sort the former sets of data

8 These files are G-Zip compressed flat files of N-Triple-like syntax encodingrbitrary length tuples of RDF constants9 Although a predicate-position join is also available we prefer object-positionins that provide smaller batches of data for the in-memory join

1

a1

jo

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 89

according to subject and perform a merge-join by means of a linearscan thereafter the same procedure applies to the latter set sortingand merge-joining on the object position Applying merge-join scanswe produce new owlsameAs statements

Both the originally asserted and newly inferred owlsameAs

relations are similarly written to an on-disk file over which wenow wish to perform the canonicalised symmetrictransitive clo-sure We apply a similar method again leveraging external sortsand merge-joins to perform the computation (herein we sketchand point the interested reader to [31 Section 46]) In the follow-ing we use gt and lt to denote lexical ordering SA as a shortcut forowlsameAs a b c etc to denote members of U [ B such thata lt b lt c and define URIs to be lexically lower than blank nodesThe process is as follows

(i) we only materialise symmetric equality relations thatinvolve a (possibly intermediary) canonical term chosenby a lexical ordering given b SA a we materialise a SA bgiven a SA b SA c we materialise the relations a SA b aSA c and their inverses but do not materialise b SA c orits inverse

(ii) transitivity is supported by iterative merge-join scans

20 One could instead build an on-disk map for equivalence classes and pivoelements and follow a consolidation method similar to the previous section over theunordered data however we would expect such an on-disk index to have a low cachehit-rate given the nature of the data which would lead to many (slow) disk seeks Analternative approach might be to hash-partition the corpus and equality index overdifferent machines however this would require a non-trivial minimum amount omemory on the cluster

ndash in the scan if we find c SA a (sorted by object) and c SA d(sorted naturally) we infer a SA d and drop the non-canonical c SA d (and d SA c)

ndash at the end of the scan newly inferred triples are marked and merge-joined

into the main equality data any triples echoing an earlier inference are ignored dropped non-canonical statements are removed

ndash the process is then iterative in the next scan if we find dSA a and d SA e we infer a SA e and e SA a

ndash inferences will only occur if they involve a statementadded in the previous iteration ensuring that inferencesteps are not re-computed and that the computation willterminate

(iii) the above iterations stop when a fixpoint is reached andnothing new is inferred

(iv) the process of reaching a fixpoint is accelerated using avail-able main-memory to store a cache of partial equalitychains

The above steps follow well-known methods for transtive clo-sure computation modified to support the symmetry ofowlsameAs and to use adaptive canonicalisation in order toavoid quadratic output With respect to the last item we useAlgorithm 1 to derive lsquolsquobatchesrsquorsquo of in-memory equivalencesand when in-memory capacity is achieved we write thesebatches to disk and proceed with on-disk computation this isparticularly useful for computing the small number of longequality chains which would otherwise require sorts andmerge-joins over all of the canonical owlsameAs data currentlyderived and where the number of iterations would otherwise bethe length of the longest chain The result of this process is a setof canonicalised equality relations representing the symmetrictransitive closure

Next we briefly describe the process of canonicalising data withrespect to this on-disk equality closure where we again use exter-nal-sorts and merge-joins First we prune the owlsameAs indexto only maintain relations s1 SA s2 such that s1 gt s2 thus given s1

SA s2 we know that s2 is the canonical identifier and s1 is to berewritten We then sort the data according to the position thatwe wish to rewrite and perform a merge-join over both sets ofdata buffering the canonicalised data to an output file If we wantto rewrite multiple positions of a file of tuples (eg subject and

object) we must rewrite one position sort the results by the sec-ond position and then rewrite the second position20

Finally note that in the derivation of owlsameAs from the con-solidation rules prp-fp prp-ifp cax-maxc2 the overall processmay be iterative For example consider

dblpAxel foafisPrimaryTopicOf lthttppolleresnetgt

axelme foafisPrimaryTopicOf lthttpaxelderiiegt

lthttppolleresnetgt owlsameAs lthttpaxelderiiegt

from which the conclusion that dblpAxel is the same asaxelme holds We see that new owlsameAs relations (either as-serted or derived from the consolidation rules) may in turn lsquolsquoalignrsquorsquoterms in the join position of the consolidation rules leading to newequivalences Thus for deriving the final owlsameAs we require ahigher-level iterative process as follows

(i) initially apply the consolidation rules and append theresults to a file alongside the asserted owl sameAs state-ments found

(ii) apply the initial closure of the owlsameAs data(iii) then iteratively until no new owlsameAs inferences are

found

ndash rewrite the join positions of the on-disk files containing

the data for each consolidation rule according to the cur-rent owlsameAs data

ndash derive new owlsameAs inferences possible through theprevious rewriting for each consolidation rule

ndash re-derive the closure of the owlsameAs data includingthe new inferences

The final closed file of owlsameAs data can then be re-used torewrite the main corpus in two sorts and merge-join scans oversubject and object

52 Distributed approach

The distributed approach follows quite naturally from theprevious discussion As before we assume that the input dataare evenly pre-distributed over the slave machines (in anyarbitrary ordering) where we can then apply the followingprocess

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract relevant terminologicalknowledge

(ii) Gather gather terminological data onto the master machineand thereafter bind the terminological patterns of the gen-eralconsolidation rules

(iii) Flood flood the rules for reasoning and the consolidation-relevant patterns to all slave machines

(iv) Run apply reasoning and extract consolidation-relevantstatements from the input and inferred data

(v) Gather gather all consolidation statements onto the mastermachine then in parallel

t

f

Table 8Breakdown of timing of distributed extended consolidation wreasoning where identifies the masterslave tasks that run concurrently

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 4543 1000 2302 1000 1275 1000 813 1000 7404 1000Executing 299 66 303 131 325 255 301 370 2436 329

Aggregate Consolidation Relevant Data 43 09 45 20 47 37 46 57 299 40Closing owlsameAs

244 54 245 106 254 199 244 300 2112 285Miscellaneous 12 03 13 05 24 19 11 13 25 03

Idle (waiting for slaves) 4245 934 1999 869 950 745 512 630 4968 671

SlaveAvg Executing (total) 4489 988 2212 961 1116 875 565 694 5991 809

Extract Terminology 444 98 227 99 118 92 69 84 494 67Extract Consolidation Relevant Data 955 210 483 210 243 191 132 162 1379 186Initial Sort (by subject) 980 216 483 210 237 186 116 142 1338 181Consolidation 2110 464 1019 443 519 407 249 306 2780 375

Avg Idle 55 12 89 39 379 297 236 290 1413 191Waiting for peers 00 00 32 14 71 56 63 77 375 51Waiting for master 55 12 58 25 308 241 173 212 1038 140

90 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

ndash Local compute the closure of the consolidation rules andthe owlsameAs data on the master machine

ndash Run each slave machine sorts its fragment of the maincorpus according to natural order (spoc)

1 At least in terms of pure quantity However we do not give an indication of theuality or importance of those few equivalences we miss with this approximationhich may well be application specific

(vi) Flood send the closed owlsameAs data to the slavemachines once the distributed sort has been completed

(vii) Run each slave machine then rewrites the subjects of theirsegment of the corpus subsequently sorts the rewritten databy object and then rewrites the objects (of non-rdftypetriples) with respect to the closed owlsameAs data

53 Performance evaluation

Applying the above process to our 1118 billion quadruple corpustook 1234 h Extracting the terminological data took 114 h with anaverage idle time of 19 min (277) (one machine took 18 minlonger than the rest due to processing a large ontology containing2 million quadruples [32]) Merging and aggregating the termino-logical data took roughly 1 min Applying the reasoning andextracting the consolidation relevant statements took 234 h withan average idle time of 25 min (18) Aggregating and mergingthe consolidation relevant statements took 299 min Thereafterlocally computing the closure of the consolidation rules and theequality data took 352 h with the computation requiring two itera-tions overall (the minimum possiblemdashthe second iteration did notproduce any new results) Concurrent to the previous step the paral-lel sort of remote data by natural order took 233 h with an averageidle time of 6 min (43) Subsequent parallel consolidation of thedata took 48 h with 10 min (35) average idle timemdashof this 19of the time was spent consolidating the pre-sorted subjects 60of the time was spent sorting the rewritten data by object and21 of the time was spent consolidating the objects of the data

As before Table 8 summarises the timing of the task Focusing onthe performance for the entire corpus the master machine took406 h to coordinate global knowledge constituting the lower boundon time possible for the task to execute with respect to increasingmachines in our setupmdashin future it may be worthwhile to investigatedistributed strategies for computing the owlsameAs closure(which takes 285 of the total computation time) but for the mo-ment we mitigate the cost by concurrently running a sort on theslave machines thus keeping the slaves busy for 634 of the timetaken for this local aggregation step The slave machines were onaverage busy for 809 of the total task time of the idle time733 was spent waiting for the master machine to aggregate theconsolidation relevant data and to finish the closure of owl sameAs

data and the balance (267) was spent waiting for peers to finish(mostly during the extraction of terminological data)

With regards performance evaluation over the 100 million qua-druple corpus for a varying number of slave machines perhapsmost notably the average idle time of the slave machines increasesquite rapidly as the number of machines increases In particular by4 machines the slaves can perform the distributed sort-by-subjecttask faster than the master machine can perform the local owlsameAs closure The significant workload of the master machinewill eventually become a bottleneck as the number of slave ma-chines increases further This is also noticeable in terms of overalltask time the respective time savings as the number of slave ma-chines doubles (moving from 1 slave machine to 8) are 0507 0554 and 0638 respectively

Briefly we also ran the consolidation without the general rea-soning rules (Table 3) motivated earlier With respect to perfor-mance the main variations were given by (i) the extraction ofconsolidation relevant statementsmdashthis time directly extractedfrom explicit statements as opposed to explicit and inferred state-mentsmdashwhich took 154 min (11 of the time taken including thegeneral reasoning) with an average idle time of less than one min-ute (6 average idle time) (ii) local aggregation of the consolida-tion relevant statements took 17 min (569 of the time takenpreviously) (iii) local closure of the owlsameAs data took318 h (904 of the time taken previously) The total time savedequated to 28h (227) where 333 min were saved from coordi-nation on the master machine and 225 h were saved from parallelexecution on the slave machines

54 Results evaluation

In this section we present the results of the consolidationwhich included the general reasoning step in the extraction ofthe consolidation-relevant statements In fact we found that theonly major variation between the two approaches was in theamount of consolidation-relevant statements collected (discussedpresently) where the changes in the total amounts of coreferentidentifiers equivalence classes and canonicalised terms were infact negligible (lt01) Thus for our corpus extracting only as-serted consolidation-relevant statements offered a very closeapproximation of the extended reasoning approach21

2

qw

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 7: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

Table 2OWL 2 RLRDF rules that directly produce owl sameAs relations we denote authoritative variables with bold and italicise the labels of rules requiring new OWL 2 constructs

OWL2RL Antecedent Consequent

Terminological Assertional

prp-fp p a owlFunctionalProperty x p y1 y2 y1 owlsameAs y2

prp-ifp p a owlInverseFunctionalProperty x1 p y x2 p y x1 owlsameAs x2

cls-maxc2 x owlmaxCardinality 1 u a x y1 owlsameAs y2

x owlonProperty p u p y1 y2

cls-maxqc4 x owlmaxQualifiedCardinality 1 u a x y1 owlsameAs y2

x owlonProperty p u p y1 y2

x owlonClass owl Thing

Table 3OWL 2 RLRDF rules containing precisely one assertional pattern in the body with authoritative variables in bold (cf[33])

OWL2RL Antecedent Consequent

Terminological Assertional

prp-dom p rdfsdomain c x p y x a cprp-rng p rdfsrange c x p y y a cprp-symp p a owlSymmetricProperty x p y y p xprp-spo1 p1 rdfssubPropertyOf p2 x p1 y x p2 yprp-eqp1 p1 owlequivalentProperty p2 x p1 y x p2 yprp-eqp2 p1 owlequivalentProperty p2 x p2 y x p1 yprp-inv1 p1 owlinverseOf p2 x p1 y y p2 xprp-inv2 p1 owlinverseOf p2 x p2 y y p1 xcls-int2 c owlintersectionOf (c1 cn) x a c x a c1 cn

cls-uni c owlunionOf (c1 ci cn) x a ci x a ccls-svf2 x owlsomeValuesFrom owlThing owlonProperty p u p v u a xcls-hv1 x owlhasValue y owlonProperty p u a x u p ycls-hv2 x owlhasValue y owlonProperty p u p y u a xcax-sco c1 rdfssubClassOf c2 x a c1 x a c2

cax-eqc1 c1 owlequivalentClass c2 x a c1 x a c2

cax-eqc2 c1 owlequivalentClass c2 x a c2 x a c1

Table 4OWL 2 RLRDF rules used to detect inconsistency that we currently support wedenote authoritative variables with bold

OWL2RL Antecedent

Terminological Assertional

eq-diff1 ndash x owlsameAs yx owldifferentFrom y

prp-irp p a owlIrreflexiveProperty x p xprp-asyp p a owlAsymmetricProperty x p it y y p xprp-pdw p1owlpropertyDisjointWith p2 x p1 y p2 yprp-adp x a owlAllDisjointProperties u pi y pj y (i ndash j)

x owlmembers (p1 pn)cls-com c1 owlcomplementOf c2 x a c1 c2

cls-maxc1 x owlmaxCardinality 0 u a x p yx owlonProperty p

cls-maxqc2 x owlmaxQualifiedCardinality 0 u a x p yx owlonProperty px owlonClass owlThing

cax-dw c1 owldisjointWith c2 x a c1c2cax-adc x a owlAllDisjointClasses z a ci cj (i ndash j)

x owlmembers (c1 cn)

82 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

are sound (ie correct) with respect to formal OWL semanticssuch noise in our input data may lead to unintended consequenceswhen applying reasoning which we wish to minimise andorrepair Relatedly OWL contains features that allow for detectingformal contradictionsmdashcalled inconsistenciesmdashin RDF data Anexample of an inconsistency would be where something is amember of two disjoint classes such as foafPerson andfoafOrganization formally the intersection of such disjointclasses should be empty and they should not share membersAlong these lines OWL 2 RLRDF contains rules (with the specialsymbol false in the head) that allow for detecting inconsistencyin an RDF graph We use a subset of these consistency-checkingOWL 2 RLRDF rules later in Section 7 to try to automatically detectand repair unintended owlsameAs inferences the supportedsubset is listed in Table 4

27 Distribution architecture

To help meet our scalability requirements our methods areimplemented on a shared-nothing distributed architecture [64]over a cluster of commodity hardware The distributed frameworkconsists of a master machine that orchestrates the given tasks andseveral slave machines that perform parts of the task in parallel

The master machine can instigate the following distributedoperations

ndash Scatter partition on-disk data using some local split functionand send each chunk to individual slave machines for subse-quent processing

ndash Run request the parallel execution of a task by the slavemachinesmdashsuch a task either involves processing of somedata local to the slave machine or the coordinate method(described later) for reorganising the data under analysis

ndash Gather gathers chunks of output data from the slave swarmand performs some local merge function over the data

ndash Flood broadcast global knowledge required by all slavemachines for a future task

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 83

The master machine provides input data to the slave swarmprovides the control logic required by the distributed task (com-mencing tasks coordinating timing ending tasks) gathers andlocally perform tasks on global knowledge that the slave machineswould otherwise have to replicate in parallel and transmits glob-ally required knowledge

The slave machines as well as performing tasks in parallel canperform the following distributed operation (at the behest of themaster machine)

ndash Coordinate local data on each slave machine is partitionedaccording to some split function with the chunks sent toindividual machines in parallel each slave machine alsogathers the incoming chunks in parallel using some mergefunction

The above operation allows slave machines to reorganise(splitsendgather) intermediary amongst themselves the coor-dinate operation could be replaced by a pair of gatherscatteroperations performed by the master machine but we wish toavoid the channelling of all intermediary data through onemachine

Note that herein we assume that the input corpus is evenlydistributed and split across the slave machines and that theslave machines have roughly even specifications that is we donot consider any special form of load balancing but insteadaim to have uniform machines processing comparable data-chunks

We note that there is the practical issue of the master ma-chine being idle waiting for the slaves and more critically thepotentially large cluster of slave machines waiting idle for themaster machine One could overcome idle times with maturetask-scheduling (eg interleaving jobs) and load-balancing Froman algorithmic point of view removing the central coordinationon the master machine may enable better distributability Onepossibility would be to allow the slave machines to duplicatethe aggregation of global knowledge in parallel although thiswould free up the master machine and would probably takeroughly the same time duplicating computation wastes resourcesthat could otherwise be exploited by eg interleaving jobs Asecond possibility would be to avoid the requirement for globalknowledge and to coordinate upon the larger corpus (eg a coor-dinate function hashing on the subject and object of the data orperhaps an adaptation of the SpeedDate routing strategy [51])Such decisions are heavily influenced by the scale of the taskto perform the percentage of knowledge that is globally requiredhow the input data are distributed how the output data shouldbe distributed and the nature of the cluster over which it shouldbe performed and the task-scheduling possible The distributedimplementation of our tasks are designed to exploit a relativelysmall percentage of global knowledge which is cheap to coordi-nate and we choose to avoidmdashinsofar as reasonablemdashduplicatingcomputation

28 Experimental setup

Our entire code-base is implemented on top of standard Javalibraries We instantiate the distributed architecture using JavaRMI libraries and using the lightweight open-source Java RMIIOpackage9 for streaming data for the network

9 httpopenhmssourceforgenetrmiio

10 We observe eg a max FTP transfer rate of 38MBs between machines11 Please see [32 Appendix A] for further statistics relating to this corpus

All of our evaluation is based on nine machines connected byGigabit ethernet10 each with uniform specifications viz 22 GHzOpteron x86-64 4 GB main memory 160 GB SATA hard-disks run-ning Java 160_12 on Debian 504

3 Experimental corpus

Later in this paper we discuss the performance and results ofapplying our methods over a corpus of 1118 billion quadruplesderived from an open-domain RDFXML crawl of 3985 millionweb documents in mid-May 2010 (detailed in [3229]) The crawlwas conducted in a breadth-first manner extracting URIs from allpositions of the RDF data Individual URI queues were assigned todifferent pay-level-domains (aka PLDs domains that require pay-ment eg deriie datagovuk) where we enforced a polite-ness policy of accessing a maximum of two URIs per PLD persecond URIs with the highest inlink count per each PLD queuewere polled first

With regards the resulting corpus of the 1118 billion quads1106 billion are unique and 947 million are unique triples Thedata contain 23 thousand unique predicates and 105 thousandunique class terms (terms in the object position of an rdftype

triple) In terms of diversity the corpus consists of RDF from 783pay-level-domains [32]

Note that further details about the parameters of the crawl andstatistics about the corpus are available in [3229]

Now we discuss the usage of terms in a data-level position vizterms in the subject position or object position of non-rdftypetriples11 Since we do not consider the consolidation of literals orschema-level concepts we focus on characterising blank-node andURI re-use in such data-level positions thus rendering a picture ofthe structure of the raw data

We found 2863 million unique terms of which 1654 million(578) were blank-nodes 921 million (322) were URIs and289 million (10) were literals With respect to literals each hadon average 9473 data-level occurrences (by definition all in theobject position)

With respect to blank-nodes each had on average 5233 data-level occurrences Each occurred on average 0995 times in the ob-ject position of a non-rdftype triple with 31 million (187) notoccurring in the object position conversely each occurred on aver-age 4239 times in the subject position of a triple with 69 thousand(004) not occurring in the subject position Thus we summarisethat almost all blank-nodes appear in both the subject position andobject position but occur most prevalently in the former Impor-tantly note that in our input blank-nodes cannot be re-used acrosssources

With respect to URIs each had on average 941 data-leveloccurrences (18 the average for blank-nodes) with 4399 aver-age appearances in the subject position and 501 appearances inthe object positionmdash1985 million (2155) did not appear in anobject position whilst 5791 million (6288) did not appear in asubject position

With respect to re-use across sources each URI had a data-leveloccurrence in on average 47 documents and 1008 PLDsmdash562million (6102) of URIs appeared in only one document and913 million (9913) only appeared in one PLD Also re-use of URIsacross documents was heavily weighted in favour of use in the ob-ject position URIs appeared in the subject position in on average1061 documents and 0346 PLDs for the object position of

84 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

non-rdftype triples URIs occurred in on average 3996 docu-ments and 0727 PLDs

The URI with the most data-level occurrences (166 million)was httpidentica the URI with the most re-use acrossdocuments (appearing in 1793 thousand documents) washttpcreativecommonsorglicensesby30 the URIwith the most re-use across PLDs (appearing in 80 different do-mains) was httpwwwldoddscomfoaffoaf-a-maticAlthough some URIs do enjoy widespread re-use across differentdocuments and domains in Figs 1 and 2 we give the distribu-tion of re-use of URIs across documents and across PLDs wherea powerndashlaw relationship is roughly evidentmdashagain the majorityof URIs only appear in one document (61) or in one PLD(99)

From this analysis we can conclude that with respect to data-level terms in our corpus

ndash blank-nodes which by their very nature cannot be re-usedacross documents are 18 more prevalent than URIs

ndash despite a smaller number of unique URIs each one is used in(probably coincidentally) 18 more triples

ndash unlike blank-nodes URIs commonly only appear in either a sub-ject position or an object position

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100 1000 10000 100000 1e+006

num

ber o

f UR

Is

number of documents mentioning URI

Fig 1 Distribution of URIs and the number of documents they appear in (in a data-position)

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100

num

ber o

f UR

Is

number of PLDs mentioning URI

Fig 2 Distribution of URIs and the number of PLDs they appear in (in a data-position)

ndash each URI is re-used on average in 47 documents but usuallyonly within the same domainmdashmost external re-use is in theobject position of a triple

ndash 99 of URIs appear in only one PLD

We can conclude that within our corpusmdashitself a general crawlfor RDFXML on the Webmdashwe find that there is only sparse re-use of data-level terms across sources and particularly acrossdomains

Finally for the purposes of demonstrating performance acrossvarying numbers of machines we extract a smaller corpus com-prising of 100 million quadruples from the full evaluation corpusWe extract the sub-corpus from the head of the raw corpus sincethe data are ordered by access time polling statements from thehead roughly emulates a smaller crawl of data which shouldensure eg that all well-linked vocabularies are containedtherein

4 Base-line consolidation

We now present the lsquolsquobase-linersquorsquo algorithm for consolidationthat consumes asserted owlsameAs relations in the dataLinked Data best-practices encourage the provision ofowlsameAs links between exporters that coin different URIsfor the same entities

lsquolsquoIt is common practice to use the owlsameAs property for statingthat another data source also provides information about a specificnon-information resourcersquorsquo [7]

We would thus expect there to be a significant amount of expli-cit owlsameAs data present in our corpusmdashprovided directly bythe Linked Data publishers themselvesmdashthat can be directly usedfor consolidation

To perform consolidation over such data the first step is toextract all explicit owlsameAs data from the corpus We mustthen compute the transitive and symmetric closure of the equiv-alence relation and build equivalence classes sets of coreferentidentifiers Note that the set of equivalence classes forms a parti-tion of coreferent identifiers where due to the transitive andsymmetric semantics of owlsameAs each identifier can only ap-pear in one such class Also note that we do not need to considersingleton equivalence classes that contain only one identifierconsolidation need not perform any action if an identifier isfound only to be coreferent with itself Once the closure of as-serted owlsameAs data has been computed we then need tobuild an index that maps identifiers to the equivalence class inwhich it is contained This index enables lookups of coreferentidentifiers Next in the consolidated data we would like to col-lapse the data mentioning identifiers in each equivalence classto instead use a single consistent canonical identifier for eachequivalence class we must thus choose a canonical identifier Fi-nally we can scan the entire corpus and using the equivalenceclass index rewrite the original identifiers to their canonincalform As an optional step we can also add links between eachcanonical identifier and its coreferent forms using an artificalowlsameAs relation in the output thus persisting the originalidentifiers

The distributed algorithm presented in this section is basedon an in-memory equivalence class closure and index and isthe current method of consolidation employed by the SWSE sys-tem described previously in [32] Herein we briefly reintroducethe approach from [32] where we also add new performanceevaluation over varying numbers of machines in the distributedsetup present more discussion and analysis of the use ofowlsameAs in our Linked Data corpus and manually evaluatethe precision of the approach for an extended sample of one

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 85

thousand coreferent pairs In particular this section serves as abaseline for comparison against the extended consolidation ap-proach that uses richer reasoning features explored later inSection 5

41 High-level approach

Based on the previous discussion the approach is straight-forward

(i) scan the corpus and separate out all asserted owlsameAs

relations from the main body of the corpus(ii) load these relations into an in-memory index that encodes

the transitive and symmetric semantics of owlsameAs(iii) for each equivalence class in the index choose a canonical

term(iv) scan the corpus again canonicalising any term in the

subject position or object position of an rdftype

triple

Thus we need only index a small subset of the corpusmdashowlsameAs statementsmdashand can apply consolidation by meansof two scans

The non-trivial aspects of the algorithm are given by theequality closure and index To perform the in-memory transi-tive symmetric closure we use a traditional unionndashfind algo-rithm [6640] for computing equivalence partitions where (i)equivalent elements are stored in common sets such that eachelement is only contained in one set (ii) a map provides a find

function for looking up which set an element belongs to and(iii) when new equivalent elements are found the sets they be-long to are unioned The process is based on an in-memory mapindex and is detailed in Algorithm 1 where

(i) the eqc labels refer to equivalence classes and s and o refer toRDF subject and object terms

(ii) the function mapget refers to an identifier lookup on the in-memory index that should return the intermediary equiva-lence class associated with that identifier (ie find)

(iii) the function mapput associates an identifier with a newequivalence class

The output of the algorithm is an in-memory map from identi-fiers to their respective equivalence class

Next we must choose a canonical term for each equivalenceclass we prefer URIs over blank-nodes thereafter choosing a termwith the lowest alphabetical ordering a canonical term is thusassociated to each equivalent set Once the equivalence index hasbeen finalised we re-scan the corpus and canonicalise the datausing the in-memory index to service lookups

42 Distributed approach

Again distribution of the approach is fairly intuitive as follows

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract owlsameAs relations

(ii) Gather gather all owlsameAs relations onto the mastermachine and build the in-memory equality index

(iii) Floodrun send the equality index (in its entirety) to eachslave machine and apply the consolidation scan in parallel

As we will see in the next section the most expensive meth-odsmdashinvolving the two scans of the main corpusmdashcan be con-ducted in parallel

Algorithm 1

Building equivalence map [32]

Require SAMEAS DATA SA1 map 2 for (sowlsameAso) 2 SA ^ s ndash o do3 eqcs mapget(s)4 eqco mapget(o)5 if eqcs = ^ eqco = then6 eqcs[o so7 mapput(seqcs[o)8 mapput(oeqcs[o)9 else if eqcs = then

10 add s to eqco

11 mapput(seqco)12 else if eqco = then13 add o to eqcs

14 mapput(oeqcs)15 else if eqcs ndash eqco then16 if jeqcsj gt jeqcoj17 add all eqco into eqcs

18 for eo 2 eqco do19 mapput(eoeqcs)20 end for21 else22 add all eqcs into eqco

23 for es 2 eqcs do24 mapput(eseqco)25 end for26 end if27 end if28 end for

43 Performance evaluation

We applied the distributed base-line consolidation over our cor-pus with the aformentioned procedure and setup The entire con-solidation process took 633 min with the bulk of time taken asfollows the first scan extracting owlsameAs statements took125 min with an average idle time for the servers of 11 s(14)mdashie on average the slave machines spent 14 of the timeidly waiting for peers to finish Transferring aggregating and load-ing the owlsameAs statements on the master machine took84 min The second scan rewriting the data according to thecanonical identifiers took in total 423 min with an average idletime of 647 s (25) for each machine at the end of the roundThe slower time for the second round is attributable to the extraoverhead of re-writing the GZip-compressed data to disk as op-posed to just reading

In the rightmost column of Table 5 (full-8) we give a break-down of the timing for the tasks over the full corpus using eightslave machines and one master machine Independent of the num-ber of slaves we note that the master machine required 85 min forcoordinating globally-required owlsameAs knowledge and thatthe rest of the task time is spent in embarrassingly parallel execu-tion (amenable to reduction by increasing the number of ma-chines) For our setup the slave machines were kept busy for onaverage 846 of the total task time of the idle time 87 wasspent waiting for the master to coordinate the owlsameAs dataand 13 was spent waiting for peers to finish their task due tosub-optimal load balancing The master machine spent 866 ofthe task idle waiting for the slaves to finish

In addition Table 5 also gives performance results for onemaster and varying numbers of slave machines (1248) over the

Table 5Breakdown of timing of distributed baseline consolidation

Category 1 2 4 8 Full-8

Min Min Min min Min

MasterTotal execution time 422 1000 223 1000 119 1000 67 1000 633 1000Executing 05 11 05 22 05 40 05 75 85 134

Aggregating owl sameAs 04 11 05 22 05 40 05 75 84 133Miscellaneous 01 01 01 04 01 01

Idle (waiting for slaves) 418 989 218 977 114 958 62 921 548 866

SlavesAvg Executing (total) 418 989 218 977 114 958 62 921 535 846

Extract owl sameAs 90 214 45 202 22 184 11 168 123 195Consolidate 327 775 171 769 88 735 47 705 412 651

Avg Idle 05 11 07 29 10 82 08 127 98 154Waiting for peers 00 00 01 07 05 40 03 47 13 20Waiting for master 05 11 05 23 05 42 05 79 85 134

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000

num

ber o

f cla

sses

equivalence class size

Fig 3 Distribution of sizes of equivalence classes on loglog scale (from [32])

1

10

100

1000

10000

100000

1e+006

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

Fig 4 Distribution of the number of PLDs per equivalence class on loglog scale

2 For example see the coreference results given by httpwwwrkbexplorercoma m e A s u r i = h t t p w w w a c m r k b e x p l o r e r c o m i d p e r s o n - 5 3 2 9 2 -2877d02973d0d01e8f29c7113776e7e which at the time of writing correspond to36 out of the 443 equivalent URIs found for Dieter Fensel

86 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

100 million quadruple corpus Note that in the table we use thesymbol to indicate a negligible value (0 lt lt 005) We see thatas the number of slave machines increases so too does the percent-age of time taken to aggregate global knowledge on the master ma-chine (the absolute time stays roughly stable) This indicates thatfor a high number of machines the aggregation of global knowl-edge will eventually become a bottleneck and an alternative distri-bution strategy may be preferable subject to further investigationOverall however we see that the total execution times roughlyhalve when the number of slave machines are doubled we observea 0528 0536 and 0561 reduction in total task executiontime moving from 1 slave machine to 2 2 to 4 and 4 to 8 respec-tively (here a value of 05 would indicate linear scaling out)

44 Results evaluation

We extracted 1193 million raw owlsameAs quadruples fromthe full corpus Of these however there were only 377 million un-ique triples After closure the data formed 216 million equivalenceclasses mentioning 575 million terms (624 of URIs)mdashan averageof 265 elements per equivalence class Of the 575 million termsonly 4156 were blank-nodes Fig 3 presents the distribution ofsizes of the equivalence classes where the largest equivalenceclass contains 8481 equivalent entities and 16 million (741)equivalence classes contain the minimum two equivalent identifi-ers Fig 4 shows a similar distribution but for the number of PLDsextracted from URIs in each equivalence class Interestingly themajority (571) of equivalence classes contained identifiers from

more than one PLD where the most diverse equivalence class con-tained URIs from 32 PLDs

Table 6 shows the canonical URIs for the largest five equiva-lence classes we manually inspected the results and showwhether or not the results were verified as correctincorrect In-deed results for class 1 and 2 were deemed incorrect due toover-use of owlsameAs for linking drug-related entities in theDailyMed and LinkedCT exporters Results 3 and 5 were verifiedas correct consolidation of prominent Semantic Web relatedauthors resp Dieter Fensel and Rudi Studermdashauthors are gi-ven many duplicate URIs by the RKBExplorer coreference index12

Result 4 contained URIs from various sites generally referring tothe United States mostly from DBPedia and LastFM With respectto the DBPedia URIs these (i) were equivalent but for capitilisationvariations or stop-words (ii) were variations of abbreviations or va-lid synonyms (iii) were different language versions (eg dbpediaEtats_Unis) (iv) were nicknames (eg dbpediaYankee_land)(v) were related but not equivalent (eg dbpediaAmerican_Civilization) (vi) were just noise (eg dbpediaLOL_Dean)

Besides the largest equivalence classes which we have seen areprone to errors perhaps due to the snowballing effect of the tran-sitive and symmetric closure we also manually evaluated a ran-dom sample of results For the sampling we take the closed

1

s24

Table 6Largest five equivalence classes (from [32])

Canonical term (lexically first in equivalence class) Size Correct

1 httpbio2rdforgdailymed_drugs1000 84812 httpbio2rdforgdailymed_drugs1042 8003 httpacmrkbexplorercomidperson-53292-22877d02973d0d01e8f29c7113776e7e 443 U

4 httpagame2teachcomddb61cae0e083f705f65944cc3bb3968ce3f3ab59-ge_1 353 U5 httpwwwacmrkbexplorercomidperson-236166-1b4ef5fdf4a5216256064c45a8923bc9 316 U

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 87

equivalence classes extracted from the full corpus pick an identi-fier (possibly non-canonical) at random and then randomly pickanother coreferent identifier from its equivalence class We thenretrieve the data associated for both identifiers and manually in-spect them to see if they are in fact equivalent For each pairwe then select one of four options SAME DIFFERENT UNCLEAR TRIVIALLY

SAME We used the UNCLEAR option sparingly and only where therewas not enough information to make an informed decision or fordifficult subjective choices note that where there was not enoughinformation in our corpus we tried to dereference more informa-tion to minimise use of UNCLEAR in terms of difficult subjectivechoices one example pair we marked unclear was dbpediaFolk-core and dbpediaExperimental_folk We used the TRIVIALLY

SAME option to indicate that an equivalence is purely lsquolsquosyntacticrsquorsquowhere one identifier has no information other than being the targetof a owlsameAs link although the identifiers are still coreferentwe distinguish this case since the equivalence is not so lsquolsquomeaning-fulrsquorsquo For the 1000 pairs we choose option TRIVIALLY SAME 661 times(661) SAME 301 times (301) DIFFERENT 28 times (28) and UN-

CLEAR 10 times (1) In summary our manual verification of thebaseline results puts the precision at 972 albeit with manysyntatic equivalences found Of those found to be incorrect manywere closely-related DBpedia resources that were linked to by thesame resource on the Freebase exporter an example of this wasdbpediaRock_Hudson and dbpediaRock_Hudson_filmogra-phy having owlsameAs links from the same Freebase conceptfbrock_hudson

Moving on in Table 7 we give the most frequently co-occurringPLD-pairs in our equivalence classes where datasets resident onthese domains are lsquolsquoheavilyrsquorsquo interlinked with owl sameAs rela-tions We italicise the indexes of inter-domain links that were ob-served to often be purely syntactic

With respect to consolidation identifiers in 786 million subjectpositions (7 of subject positions) and 232 million non-rdftype-object positions (26) were rewritten giving a total of1019 million positions rewritten (51 of total rewritable posi-tions) The average number of documents mentioning each URIrose slightly from 4691 to 4719 (a 06 increase) due to consoli-dation and the average number of PLDs also rose slightly from1005 to 1007 (a 02 increase)

Table 7Top 10 PLD pairs co-occurring in the equivalence classes with number of equivalenceclasses they co-occur in

PLD PLD Co-occur

1 rdfizecom uriburnercom 5066232 dbpediaorg freebasecom 1870543 bio2rdforg purlorg 1853924 locgov infolcauthorities

a 1666505 l3sde rkbexplorercom 1258426 bibsonomyorg l3sde 1258327 bibsonomyorg rkbexplorercom 1258308 dbpediaorg mpiide 998279 freebasecom mpiide 8961010 dbpediaorg umbelorg 65565

a In fact these were within the same PLD

5 Extended reasoning consolidation

We now look at extending the baseline approach to includemore expressive reasoning capabilities in particular using OWL 2RLRDF rules (introduced previously in Section 26) to infer novelowlsameAs relations that can then be used for consolidationalongside the explicit relations asserted by publishers

As described in Section 22 OWL provides various featuresmdashincluding functional properties inverse-functional properties andcertain cardinality restrictionsmdashthat allow for inferring novelowlsameAs data Such features could ease the burden on publish-ers of providing explicit owlsameAs mappings to coreferent iden-tifiers in remote naming schemes For example inverse-functionalproperties can be used in conjunction with legacy identificationschemesmdashsuch as ISBNs for books EAN UCC-13 or MPN for prod-ucts MAC addresses for network-enabled devices etcmdashto boot-strap identity on the Web of Data within certain domains suchidentification values can be encoded as simple datatype stringsthus bypassing the requirement for bespoke agreement or map-pings between URIs Also lsquolsquoinformation resourcesrsquorsquo with indigenousURIs can be used for indirectly identifying related resources towhich they have a functional mapping where examples includepersonal email-addresses personal homepages WIKIPEDIA articlesetc

Although Linked Data literature has not explicitly endorsed orencouraged such usage prominent grass-roots efforts publishingRDF on the Web rely (or have relied) on inverse-functional proper-ties for maintaining consistent identity For example in the FriendOf A Friend (FOAF) community a technique called smushing wasproposed to leverage such properties for identity serving as anearly precursor to methods described herein13

Versus the baseline consolidation approach we must first ap-ply some reasoning techniques to infer novel owlsameAs rela-tions Once we have the inferred owlsameAs data materialisedand merged with the asserted owlsameAs relations we can thenapply the same consolidation approach as per the baseline (i) ap-ply the transitive symmetric closure and generate the equivalenceclasses (ii) pick canonical identifiers for each class and (iii) re-write the corpus However in contrast to the baseline we expectthe extended volume of owlsameAs data to make in-memorystorage infeasible14 Thus in this section we avoid the need tostore equivalence information in-memory where we instead inves-tigate batch-processing methods for applying an on-disk closure ofthe equivalence relations and likewise on-disk methods for rewrit-ing data

Early versions of the local batch-processing [31] techniques andsome of the distributed reasoning techniques [33] are borrowedfrom previous works Herein we reformulate and combine thesetechniques into a complete solution for applying enhanced consol-idation in a distributed scalable manner over large-scale Linked

13 cf httpwikifoaf-projectorgwSmushing retr 2011012214 Further note that since all machines must have all owlsameAs information

adding more machines does not increase the capacity for handling more suchrelations the scalability of the baseline approach is limited by the machine with theleast in-memory capacity

88 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Data corpora We provide new evaluation over our 1118 billionquadruple evaluation corpus presenting new performance resultsover the full corpus and for a varying number of slave machinesWe also analyse in depth the results of applying the techniquesover our evaluation corpus analysing the applicability of ourmethods in such scenarios We contrast the results given by thebaseline and extended consolidation approaches and again manu-ally evaluate the results of the extended consolidation approach for1000 pairs found to be coreferent

51 High-level approach

In Table 2 we provide the pertinent rules for inferring new owl

sameAs relations from the data However after analysis of thedata we observed that no documents used the owlmaxQuali-

fiedCardinality construct required for the cls-maxqc rulesand that only one document defined one owlhasKey axiom15

involving properties with less than five occurrences in the datamdashhence we leave implementation of these rules for future work andnote that these new OWL 2 constructs have probably not yet had timeto find proper traction on the Web Thus on top of inferencing involv-ing explicit owlsameAs we are left with rule prp-fp which supportsthe semantics of properties typed owlFunctionalProperty andrule prp-ifp which supports the semantics of properties typedowlInverseFunctionalProperty and rule cls-maxc1 whichsupports the semantics of classes with a specified cardinality of 1for some defined property (a class restricted version of the func-tional-property inferencing)16

Thus we look at using OWL 2 RLRDF rules prp-fp prp-ifp andcls-maxc2 for inferring new owl sameAs relations between indi-vidualsmdashwe also support an additional rule that gives an exact car-dinality version of cls-maxc217 Herein we refer to these rules asconsolidation rules

We also investigate pre-applying more general OWL 2 RLRDFreasoning over the corpus to derive more complete results wherethe ruleset is available in Table 3 and is restricted to OWL 2 RLRDFrules with one assertional pattern [33] unlike the consolidationrules these general rules do not require the computation of asser-tional joins and thus are amenable to execution by means of a sin-gle-scan of the corpus We have demonstrated this profile ofreasoning to have good competency with respect to the features ofRDFS and OWL used in Linked Data [29] These general rules mayindirectly lead to the inference of additional owlsameAs relationsparticularly when combined with the consolidation rules of Table 2

Once we have used reasoning to derive novel owlsameAs

data we then apply an on-disk batch processing technique toclose the equivalence relation and to ultimately consolidate thecorpus

Thus our high-level approach is as follows

(i) extract relevant terminological data from the corpus(ii) bind the terminological patterns in the rules from this data

thus creating a larger set of general rules with only oneassertional pattern and identifying assertional patterns thatare useful for consolidation

(iii) apply general-rules over the corpus and buffer any inputinferred statements relevant for consolidation to a new file

(iv) derive the closure of owlsameAs statements from the con-solidation-relevant dataset

15 httphuemerlstadlernetrolerhrdf16 Note that we have presented non-distributed batch-processing execution of

these rules at a smaller scale (147 million quadruples) in [31]17 Exact cardinalities are disallowed in OWL 2 RL due to their effect on the formal

proposition of completeness underlying the profile but such considerations are mootin our scenario

(v) apply consolidation over the main corpus with respect to theclosed owlsameAs data

In Step (i) we extract terminological datamdashrequired for applica-tion of our rulesmdashfrom the main corpus As discussed in Section 25to help ensure robustness we apply authoritative reasoning inother words at this stage we discard any third-party terminologi-cal statements that affect inferencing over instance data for remoteclasses and properties

In Step (ii) we then compile the authoritative terminologicalstatements into the rules to generate a set of T-ground (purelyassertional) rules which can then be applied over the entirety ofthe corpus [33] As mentioned in Section 26 the T-ground generalrules only contain one assertional pattern and thus are amenableto execution by a single scan eg given the statement

homepage rdfssubPropertyOf isPrimaryTopicOf

we would generate a T-ground rulex homepage y

)x isPrimaryTopicOf y

which does not require the computation of joins We also bindthe terminological patterns of the consolidation rules in Table 2however these rules cannot be performed by means of a singlescan over the data since they contain multiple assertional pat-terns in the body Thus we instead extract triple patternssuch as

x isPrimaryTopicOf y

which may indicate data useful for consolidation (note that hereisPrimaryTopicOf is assumed to be inverse-functional)

In Step (iii) we then apply the T-ground general rules over thecorpus with a single scan Any input or inferred statements match-ing a consolidation-relevant patternmdashincluding any owlsameAs

statements foundmdashare buffered to a file for later join computationSubsequently in Step (iv) we must now compute the on-disk

canonicalised closure of the owlsameAs statements In particularwe mainly employ the following three on-disk primitives

(i) sequential scans of flat files containing line-delimitedtuples18

(ii) external-sorts where batches of statements are sorted inmemory the sorted batches written to disk and the sortedbatches merged to the final output and

(iii) merge-joins where multiple sets of data are sorted accordingto their required join position and subsequently scanned inan interleaving manner that aligns on the join position andwhere an in-memory join is applied for each individual joinelement

Using these primitives to compute the owlsameAs closureminimises the amount of main memory required where we havepresented similar approaches in [3031]

First assertional memberships of functional properties inverse-functional properties and cardinality restrictions (both propertiesand classes) are written to separate on-disk files For functional-property and cardinality reasoning a consistent join variable forthe assertional patterns is given by the subject position for in-verse-functional-property reasoning a join variable is given bythe object position19 Thus we can sort the former sets of data

8 These files are G-Zip compressed flat files of N-Triple-like syntax encodingrbitrary length tuples of RDF constants9 Although a predicate-position join is also available we prefer object-positionins that provide smaller batches of data for the in-memory join

1

a1

jo

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 89

according to subject and perform a merge-join by means of a linearscan thereafter the same procedure applies to the latter set sortingand merge-joining on the object position Applying merge-join scanswe produce new owlsameAs statements

Both the originally asserted and newly inferred owlsameAs

relations are similarly written to an on-disk file over which wenow wish to perform the canonicalised symmetrictransitive clo-sure We apply a similar method again leveraging external sortsand merge-joins to perform the computation (herein we sketchand point the interested reader to [31 Section 46]) In the follow-ing we use gt and lt to denote lexical ordering SA as a shortcut forowlsameAs a b c etc to denote members of U [ B such thata lt b lt c and define URIs to be lexically lower than blank nodesThe process is as follows

(i) we only materialise symmetric equality relations thatinvolve a (possibly intermediary) canonical term chosenby a lexical ordering given b SA a we materialise a SA bgiven a SA b SA c we materialise the relations a SA b aSA c and their inverses but do not materialise b SA c orits inverse

(ii) transitivity is supported by iterative merge-join scans

20 One could instead build an on-disk map for equivalence classes and pivoelements and follow a consolidation method similar to the previous section over theunordered data however we would expect such an on-disk index to have a low cachehit-rate given the nature of the data which would lead to many (slow) disk seeks Analternative approach might be to hash-partition the corpus and equality index overdifferent machines however this would require a non-trivial minimum amount omemory on the cluster

ndash in the scan if we find c SA a (sorted by object) and c SA d(sorted naturally) we infer a SA d and drop the non-canonical c SA d (and d SA c)

ndash at the end of the scan newly inferred triples are marked and merge-joined

into the main equality data any triples echoing an earlier inference are ignored dropped non-canonical statements are removed

ndash the process is then iterative in the next scan if we find dSA a and d SA e we infer a SA e and e SA a

ndash inferences will only occur if they involve a statementadded in the previous iteration ensuring that inferencesteps are not re-computed and that the computation willterminate

(iii) the above iterations stop when a fixpoint is reached andnothing new is inferred

(iv) the process of reaching a fixpoint is accelerated using avail-able main-memory to store a cache of partial equalitychains

The above steps follow well-known methods for transtive clo-sure computation modified to support the symmetry ofowlsameAs and to use adaptive canonicalisation in order toavoid quadratic output With respect to the last item we useAlgorithm 1 to derive lsquolsquobatchesrsquorsquo of in-memory equivalencesand when in-memory capacity is achieved we write thesebatches to disk and proceed with on-disk computation this isparticularly useful for computing the small number of longequality chains which would otherwise require sorts andmerge-joins over all of the canonical owlsameAs data currentlyderived and where the number of iterations would otherwise bethe length of the longest chain The result of this process is a setof canonicalised equality relations representing the symmetrictransitive closure

Next we briefly describe the process of canonicalising data withrespect to this on-disk equality closure where we again use exter-nal-sorts and merge-joins First we prune the owlsameAs indexto only maintain relations s1 SA s2 such that s1 gt s2 thus given s1

SA s2 we know that s2 is the canonical identifier and s1 is to berewritten We then sort the data according to the position thatwe wish to rewrite and perform a merge-join over both sets ofdata buffering the canonicalised data to an output file If we wantto rewrite multiple positions of a file of tuples (eg subject and

object) we must rewrite one position sort the results by the sec-ond position and then rewrite the second position20

Finally note that in the derivation of owlsameAs from the con-solidation rules prp-fp prp-ifp cax-maxc2 the overall processmay be iterative For example consider

dblpAxel foafisPrimaryTopicOf lthttppolleresnetgt

axelme foafisPrimaryTopicOf lthttpaxelderiiegt

lthttppolleresnetgt owlsameAs lthttpaxelderiiegt

from which the conclusion that dblpAxel is the same asaxelme holds We see that new owlsameAs relations (either as-serted or derived from the consolidation rules) may in turn lsquolsquoalignrsquorsquoterms in the join position of the consolidation rules leading to newequivalences Thus for deriving the final owlsameAs we require ahigher-level iterative process as follows

(i) initially apply the consolidation rules and append theresults to a file alongside the asserted owl sameAs state-ments found

(ii) apply the initial closure of the owlsameAs data(iii) then iteratively until no new owlsameAs inferences are

found

ndash rewrite the join positions of the on-disk files containing

the data for each consolidation rule according to the cur-rent owlsameAs data

ndash derive new owlsameAs inferences possible through theprevious rewriting for each consolidation rule

ndash re-derive the closure of the owlsameAs data includingthe new inferences

The final closed file of owlsameAs data can then be re-used torewrite the main corpus in two sorts and merge-join scans oversubject and object

52 Distributed approach

The distributed approach follows quite naturally from theprevious discussion As before we assume that the input dataare evenly pre-distributed over the slave machines (in anyarbitrary ordering) where we can then apply the followingprocess

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract relevant terminologicalknowledge

(ii) Gather gather terminological data onto the master machineand thereafter bind the terminological patterns of the gen-eralconsolidation rules

(iii) Flood flood the rules for reasoning and the consolidation-relevant patterns to all slave machines

(iv) Run apply reasoning and extract consolidation-relevantstatements from the input and inferred data

(v) Gather gather all consolidation statements onto the mastermachine then in parallel

t

f

Table 8Breakdown of timing of distributed extended consolidation wreasoning where identifies the masterslave tasks that run concurrently

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 4543 1000 2302 1000 1275 1000 813 1000 7404 1000Executing 299 66 303 131 325 255 301 370 2436 329

Aggregate Consolidation Relevant Data 43 09 45 20 47 37 46 57 299 40Closing owlsameAs

244 54 245 106 254 199 244 300 2112 285Miscellaneous 12 03 13 05 24 19 11 13 25 03

Idle (waiting for slaves) 4245 934 1999 869 950 745 512 630 4968 671

SlaveAvg Executing (total) 4489 988 2212 961 1116 875 565 694 5991 809

Extract Terminology 444 98 227 99 118 92 69 84 494 67Extract Consolidation Relevant Data 955 210 483 210 243 191 132 162 1379 186Initial Sort (by subject) 980 216 483 210 237 186 116 142 1338 181Consolidation 2110 464 1019 443 519 407 249 306 2780 375

Avg Idle 55 12 89 39 379 297 236 290 1413 191Waiting for peers 00 00 32 14 71 56 63 77 375 51Waiting for master 55 12 58 25 308 241 173 212 1038 140

90 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

ndash Local compute the closure of the consolidation rules andthe owlsameAs data on the master machine

ndash Run each slave machine sorts its fragment of the maincorpus according to natural order (spoc)

1 At least in terms of pure quantity However we do not give an indication of theuality or importance of those few equivalences we miss with this approximationhich may well be application specific

(vi) Flood send the closed owlsameAs data to the slavemachines once the distributed sort has been completed

(vii) Run each slave machine then rewrites the subjects of theirsegment of the corpus subsequently sorts the rewritten databy object and then rewrites the objects (of non-rdftypetriples) with respect to the closed owlsameAs data

53 Performance evaluation

Applying the above process to our 1118 billion quadruple corpustook 1234 h Extracting the terminological data took 114 h with anaverage idle time of 19 min (277) (one machine took 18 minlonger than the rest due to processing a large ontology containing2 million quadruples [32]) Merging and aggregating the termino-logical data took roughly 1 min Applying the reasoning andextracting the consolidation relevant statements took 234 h withan average idle time of 25 min (18) Aggregating and mergingthe consolidation relevant statements took 299 min Thereafterlocally computing the closure of the consolidation rules and theequality data took 352 h with the computation requiring two itera-tions overall (the minimum possiblemdashthe second iteration did notproduce any new results) Concurrent to the previous step the paral-lel sort of remote data by natural order took 233 h with an averageidle time of 6 min (43) Subsequent parallel consolidation of thedata took 48 h with 10 min (35) average idle timemdashof this 19of the time was spent consolidating the pre-sorted subjects 60of the time was spent sorting the rewritten data by object and21 of the time was spent consolidating the objects of the data

As before Table 8 summarises the timing of the task Focusing onthe performance for the entire corpus the master machine took406 h to coordinate global knowledge constituting the lower boundon time possible for the task to execute with respect to increasingmachines in our setupmdashin future it may be worthwhile to investigatedistributed strategies for computing the owlsameAs closure(which takes 285 of the total computation time) but for the mo-ment we mitigate the cost by concurrently running a sort on theslave machines thus keeping the slaves busy for 634 of the timetaken for this local aggregation step The slave machines were onaverage busy for 809 of the total task time of the idle time733 was spent waiting for the master machine to aggregate theconsolidation relevant data and to finish the closure of owl sameAs

data and the balance (267) was spent waiting for peers to finish(mostly during the extraction of terminological data)

With regards performance evaluation over the 100 million qua-druple corpus for a varying number of slave machines perhapsmost notably the average idle time of the slave machines increasesquite rapidly as the number of machines increases In particular by4 machines the slaves can perform the distributed sort-by-subjecttask faster than the master machine can perform the local owlsameAs closure The significant workload of the master machinewill eventually become a bottleneck as the number of slave ma-chines increases further This is also noticeable in terms of overalltask time the respective time savings as the number of slave ma-chines doubles (moving from 1 slave machine to 8) are 0507 0554 and 0638 respectively

Briefly we also ran the consolidation without the general rea-soning rules (Table 3) motivated earlier With respect to perfor-mance the main variations were given by (i) the extraction ofconsolidation relevant statementsmdashthis time directly extractedfrom explicit statements as opposed to explicit and inferred state-mentsmdashwhich took 154 min (11 of the time taken including thegeneral reasoning) with an average idle time of less than one min-ute (6 average idle time) (ii) local aggregation of the consolida-tion relevant statements took 17 min (569 of the time takenpreviously) (iii) local closure of the owlsameAs data took318 h (904 of the time taken previously) The total time savedequated to 28h (227) where 333 min were saved from coordi-nation on the master machine and 225 h were saved from parallelexecution on the slave machines

54 Results evaluation

In this section we present the results of the consolidationwhich included the general reasoning step in the extraction ofthe consolidation-relevant statements In fact we found that theonly major variation between the two approaches was in theamount of consolidation-relevant statements collected (discussedpresently) where the changes in the total amounts of coreferentidentifiers equivalence classes and canonicalised terms were infact negligible (lt01) Thus for our corpus extracting only as-serted consolidation-relevant statements offered a very closeapproximation of the extended reasoning approach21

2

qw

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 8: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 83

The master machine provides input data to the slave swarmprovides the control logic required by the distributed task (com-mencing tasks coordinating timing ending tasks) gathers andlocally perform tasks on global knowledge that the slave machineswould otherwise have to replicate in parallel and transmits glob-ally required knowledge

The slave machines as well as performing tasks in parallel canperform the following distributed operation (at the behest of themaster machine)

ndash Coordinate local data on each slave machine is partitionedaccording to some split function with the chunks sent toindividual machines in parallel each slave machine alsogathers the incoming chunks in parallel using some mergefunction

The above operation allows slave machines to reorganise(splitsendgather) intermediary amongst themselves the coor-dinate operation could be replaced by a pair of gatherscatteroperations performed by the master machine but we wish toavoid the channelling of all intermediary data through onemachine

Note that herein we assume that the input corpus is evenlydistributed and split across the slave machines and that theslave machines have roughly even specifications that is we donot consider any special form of load balancing but insteadaim to have uniform machines processing comparable data-chunks

We note that there is the practical issue of the master ma-chine being idle waiting for the slaves and more critically thepotentially large cluster of slave machines waiting idle for themaster machine One could overcome idle times with maturetask-scheduling (eg interleaving jobs) and load-balancing Froman algorithmic point of view removing the central coordinationon the master machine may enable better distributability Onepossibility would be to allow the slave machines to duplicatethe aggregation of global knowledge in parallel although thiswould free up the master machine and would probably takeroughly the same time duplicating computation wastes resourcesthat could otherwise be exploited by eg interleaving jobs Asecond possibility would be to avoid the requirement for globalknowledge and to coordinate upon the larger corpus (eg a coor-dinate function hashing on the subject and object of the data orperhaps an adaptation of the SpeedDate routing strategy [51])Such decisions are heavily influenced by the scale of the taskto perform the percentage of knowledge that is globally requiredhow the input data are distributed how the output data shouldbe distributed and the nature of the cluster over which it shouldbe performed and the task-scheduling possible The distributedimplementation of our tasks are designed to exploit a relativelysmall percentage of global knowledge which is cheap to coordi-nate and we choose to avoidmdashinsofar as reasonablemdashduplicatingcomputation

28 Experimental setup

Our entire code-base is implemented on top of standard Javalibraries We instantiate the distributed architecture using JavaRMI libraries and using the lightweight open-source Java RMIIOpackage9 for streaming data for the network

9 httpopenhmssourceforgenetrmiio

10 We observe eg a max FTP transfer rate of 38MBs between machines11 Please see [32 Appendix A] for further statistics relating to this corpus

All of our evaluation is based on nine machines connected byGigabit ethernet10 each with uniform specifications viz 22 GHzOpteron x86-64 4 GB main memory 160 GB SATA hard-disks run-ning Java 160_12 on Debian 504

3 Experimental corpus

Later in this paper we discuss the performance and results ofapplying our methods over a corpus of 1118 billion quadruplesderived from an open-domain RDFXML crawl of 3985 millionweb documents in mid-May 2010 (detailed in [3229]) The crawlwas conducted in a breadth-first manner extracting URIs from allpositions of the RDF data Individual URI queues were assigned todifferent pay-level-domains (aka PLDs domains that require pay-ment eg deriie datagovuk) where we enforced a polite-ness policy of accessing a maximum of two URIs per PLD persecond URIs with the highest inlink count per each PLD queuewere polled first

With regards the resulting corpus of the 1118 billion quads1106 billion are unique and 947 million are unique triples Thedata contain 23 thousand unique predicates and 105 thousandunique class terms (terms in the object position of an rdftype

triple) In terms of diversity the corpus consists of RDF from 783pay-level-domains [32]

Note that further details about the parameters of the crawl andstatistics about the corpus are available in [3229]

Now we discuss the usage of terms in a data-level position vizterms in the subject position or object position of non-rdftypetriples11 Since we do not consider the consolidation of literals orschema-level concepts we focus on characterising blank-node andURI re-use in such data-level positions thus rendering a picture ofthe structure of the raw data

We found 2863 million unique terms of which 1654 million(578) were blank-nodes 921 million (322) were URIs and289 million (10) were literals With respect to literals each hadon average 9473 data-level occurrences (by definition all in theobject position)

With respect to blank-nodes each had on average 5233 data-level occurrences Each occurred on average 0995 times in the ob-ject position of a non-rdftype triple with 31 million (187) notoccurring in the object position conversely each occurred on aver-age 4239 times in the subject position of a triple with 69 thousand(004) not occurring in the subject position Thus we summarisethat almost all blank-nodes appear in both the subject position andobject position but occur most prevalently in the former Impor-tantly note that in our input blank-nodes cannot be re-used acrosssources

With respect to URIs each had on average 941 data-leveloccurrences (18 the average for blank-nodes) with 4399 aver-age appearances in the subject position and 501 appearances inthe object positionmdash1985 million (2155) did not appear in anobject position whilst 5791 million (6288) did not appear in asubject position

With respect to re-use across sources each URI had a data-leveloccurrence in on average 47 documents and 1008 PLDsmdash562million (6102) of URIs appeared in only one document and913 million (9913) only appeared in one PLD Also re-use of URIsacross documents was heavily weighted in favour of use in the ob-ject position URIs appeared in the subject position in on average1061 documents and 0346 PLDs for the object position of

84 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

non-rdftype triples URIs occurred in on average 3996 docu-ments and 0727 PLDs

The URI with the most data-level occurrences (166 million)was httpidentica the URI with the most re-use acrossdocuments (appearing in 1793 thousand documents) washttpcreativecommonsorglicensesby30 the URIwith the most re-use across PLDs (appearing in 80 different do-mains) was httpwwwldoddscomfoaffoaf-a-maticAlthough some URIs do enjoy widespread re-use across differentdocuments and domains in Figs 1 and 2 we give the distribu-tion of re-use of URIs across documents and across PLDs wherea powerndashlaw relationship is roughly evidentmdashagain the majorityof URIs only appear in one document (61) or in one PLD(99)

From this analysis we can conclude that with respect to data-level terms in our corpus

ndash blank-nodes which by their very nature cannot be re-usedacross documents are 18 more prevalent than URIs

ndash despite a smaller number of unique URIs each one is used in(probably coincidentally) 18 more triples

ndash unlike blank-nodes URIs commonly only appear in either a sub-ject position or an object position

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100 1000 10000 100000 1e+006

num

ber o

f UR

Is

number of documents mentioning URI

Fig 1 Distribution of URIs and the number of documents they appear in (in a data-position)

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100

num

ber o

f UR

Is

number of PLDs mentioning URI

Fig 2 Distribution of URIs and the number of PLDs they appear in (in a data-position)

ndash each URI is re-used on average in 47 documents but usuallyonly within the same domainmdashmost external re-use is in theobject position of a triple

ndash 99 of URIs appear in only one PLD

We can conclude that within our corpusmdashitself a general crawlfor RDFXML on the Webmdashwe find that there is only sparse re-use of data-level terms across sources and particularly acrossdomains

Finally for the purposes of demonstrating performance acrossvarying numbers of machines we extract a smaller corpus com-prising of 100 million quadruples from the full evaluation corpusWe extract the sub-corpus from the head of the raw corpus sincethe data are ordered by access time polling statements from thehead roughly emulates a smaller crawl of data which shouldensure eg that all well-linked vocabularies are containedtherein

4 Base-line consolidation

We now present the lsquolsquobase-linersquorsquo algorithm for consolidationthat consumes asserted owlsameAs relations in the dataLinked Data best-practices encourage the provision ofowlsameAs links between exporters that coin different URIsfor the same entities

lsquolsquoIt is common practice to use the owlsameAs property for statingthat another data source also provides information about a specificnon-information resourcersquorsquo [7]

We would thus expect there to be a significant amount of expli-cit owlsameAs data present in our corpusmdashprovided directly bythe Linked Data publishers themselvesmdashthat can be directly usedfor consolidation

To perform consolidation over such data the first step is toextract all explicit owlsameAs data from the corpus We mustthen compute the transitive and symmetric closure of the equiv-alence relation and build equivalence classes sets of coreferentidentifiers Note that the set of equivalence classes forms a parti-tion of coreferent identifiers where due to the transitive andsymmetric semantics of owlsameAs each identifier can only ap-pear in one such class Also note that we do not need to considersingleton equivalence classes that contain only one identifierconsolidation need not perform any action if an identifier isfound only to be coreferent with itself Once the closure of as-serted owlsameAs data has been computed we then need tobuild an index that maps identifiers to the equivalence class inwhich it is contained This index enables lookups of coreferentidentifiers Next in the consolidated data we would like to col-lapse the data mentioning identifiers in each equivalence classto instead use a single consistent canonical identifier for eachequivalence class we must thus choose a canonical identifier Fi-nally we can scan the entire corpus and using the equivalenceclass index rewrite the original identifiers to their canonincalform As an optional step we can also add links between eachcanonical identifier and its coreferent forms using an artificalowlsameAs relation in the output thus persisting the originalidentifiers

The distributed algorithm presented in this section is basedon an in-memory equivalence class closure and index and isthe current method of consolidation employed by the SWSE sys-tem described previously in [32] Herein we briefly reintroducethe approach from [32] where we also add new performanceevaluation over varying numbers of machines in the distributedsetup present more discussion and analysis of the use ofowlsameAs in our Linked Data corpus and manually evaluatethe precision of the approach for an extended sample of one

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 85

thousand coreferent pairs In particular this section serves as abaseline for comparison against the extended consolidation ap-proach that uses richer reasoning features explored later inSection 5

41 High-level approach

Based on the previous discussion the approach is straight-forward

(i) scan the corpus and separate out all asserted owlsameAs

relations from the main body of the corpus(ii) load these relations into an in-memory index that encodes

the transitive and symmetric semantics of owlsameAs(iii) for each equivalence class in the index choose a canonical

term(iv) scan the corpus again canonicalising any term in the

subject position or object position of an rdftype

triple

Thus we need only index a small subset of the corpusmdashowlsameAs statementsmdashand can apply consolidation by meansof two scans

The non-trivial aspects of the algorithm are given by theequality closure and index To perform the in-memory transi-tive symmetric closure we use a traditional unionndashfind algo-rithm [6640] for computing equivalence partitions where (i)equivalent elements are stored in common sets such that eachelement is only contained in one set (ii) a map provides a find

function for looking up which set an element belongs to and(iii) when new equivalent elements are found the sets they be-long to are unioned The process is based on an in-memory mapindex and is detailed in Algorithm 1 where

(i) the eqc labels refer to equivalence classes and s and o refer toRDF subject and object terms

(ii) the function mapget refers to an identifier lookup on the in-memory index that should return the intermediary equiva-lence class associated with that identifier (ie find)

(iii) the function mapput associates an identifier with a newequivalence class

The output of the algorithm is an in-memory map from identi-fiers to their respective equivalence class

Next we must choose a canonical term for each equivalenceclass we prefer URIs over blank-nodes thereafter choosing a termwith the lowest alphabetical ordering a canonical term is thusassociated to each equivalent set Once the equivalence index hasbeen finalised we re-scan the corpus and canonicalise the datausing the in-memory index to service lookups

42 Distributed approach

Again distribution of the approach is fairly intuitive as follows

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract owlsameAs relations

(ii) Gather gather all owlsameAs relations onto the mastermachine and build the in-memory equality index

(iii) Floodrun send the equality index (in its entirety) to eachslave machine and apply the consolidation scan in parallel

As we will see in the next section the most expensive meth-odsmdashinvolving the two scans of the main corpusmdashcan be con-ducted in parallel

Algorithm 1

Building equivalence map [32]

Require SAMEAS DATA SA1 map 2 for (sowlsameAso) 2 SA ^ s ndash o do3 eqcs mapget(s)4 eqco mapget(o)5 if eqcs = ^ eqco = then6 eqcs[o so7 mapput(seqcs[o)8 mapput(oeqcs[o)9 else if eqcs = then

10 add s to eqco

11 mapput(seqco)12 else if eqco = then13 add o to eqcs

14 mapput(oeqcs)15 else if eqcs ndash eqco then16 if jeqcsj gt jeqcoj17 add all eqco into eqcs

18 for eo 2 eqco do19 mapput(eoeqcs)20 end for21 else22 add all eqcs into eqco

23 for es 2 eqcs do24 mapput(eseqco)25 end for26 end if27 end if28 end for

43 Performance evaluation

We applied the distributed base-line consolidation over our cor-pus with the aformentioned procedure and setup The entire con-solidation process took 633 min with the bulk of time taken asfollows the first scan extracting owlsameAs statements took125 min with an average idle time for the servers of 11 s(14)mdashie on average the slave machines spent 14 of the timeidly waiting for peers to finish Transferring aggregating and load-ing the owlsameAs statements on the master machine took84 min The second scan rewriting the data according to thecanonical identifiers took in total 423 min with an average idletime of 647 s (25) for each machine at the end of the roundThe slower time for the second round is attributable to the extraoverhead of re-writing the GZip-compressed data to disk as op-posed to just reading

In the rightmost column of Table 5 (full-8) we give a break-down of the timing for the tasks over the full corpus using eightslave machines and one master machine Independent of the num-ber of slaves we note that the master machine required 85 min forcoordinating globally-required owlsameAs knowledge and thatthe rest of the task time is spent in embarrassingly parallel execu-tion (amenable to reduction by increasing the number of ma-chines) For our setup the slave machines were kept busy for onaverage 846 of the total task time of the idle time 87 wasspent waiting for the master to coordinate the owlsameAs dataand 13 was spent waiting for peers to finish their task due tosub-optimal load balancing The master machine spent 866 ofthe task idle waiting for the slaves to finish

In addition Table 5 also gives performance results for onemaster and varying numbers of slave machines (1248) over the

Table 5Breakdown of timing of distributed baseline consolidation

Category 1 2 4 8 Full-8

Min Min Min min Min

MasterTotal execution time 422 1000 223 1000 119 1000 67 1000 633 1000Executing 05 11 05 22 05 40 05 75 85 134

Aggregating owl sameAs 04 11 05 22 05 40 05 75 84 133Miscellaneous 01 01 01 04 01 01

Idle (waiting for slaves) 418 989 218 977 114 958 62 921 548 866

SlavesAvg Executing (total) 418 989 218 977 114 958 62 921 535 846

Extract owl sameAs 90 214 45 202 22 184 11 168 123 195Consolidate 327 775 171 769 88 735 47 705 412 651

Avg Idle 05 11 07 29 10 82 08 127 98 154Waiting for peers 00 00 01 07 05 40 03 47 13 20Waiting for master 05 11 05 23 05 42 05 79 85 134

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000

num

ber o

f cla

sses

equivalence class size

Fig 3 Distribution of sizes of equivalence classes on loglog scale (from [32])

1

10

100

1000

10000

100000

1e+006

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

Fig 4 Distribution of the number of PLDs per equivalence class on loglog scale

2 For example see the coreference results given by httpwwwrkbexplorercoma m e A s u r i = h t t p w w w a c m r k b e x p l o r e r c o m i d p e r s o n - 5 3 2 9 2 -2877d02973d0d01e8f29c7113776e7e which at the time of writing correspond to36 out of the 443 equivalent URIs found for Dieter Fensel

86 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

100 million quadruple corpus Note that in the table we use thesymbol to indicate a negligible value (0 lt lt 005) We see thatas the number of slave machines increases so too does the percent-age of time taken to aggregate global knowledge on the master ma-chine (the absolute time stays roughly stable) This indicates thatfor a high number of machines the aggregation of global knowl-edge will eventually become a bottleneck and an alternative distri-bution strategy may be preferable subject to further investigationOverall however we see that the total execution times roughlyhalve when the number of slave machines are doubled we observea 0528 0536 and 0561 reduction in total task executiontime moving from 1 slave machine to 2 2 to 4 and 4 to 8 respec-tively (here a value of 05 would indicate linear scaling out)

44 Results evaluation

We extracted 1193 million raw owlsameAs quadruples fromthe full corpus Of these however there were only 377 million un-ique triples After closure the data formed 216 million equivalenceclasses mentioning 575 million terms (624 of URIs)mdashan averageof 265 elements per equivalence class Of the 575 million termsonly 4156 were blank-nodes Fig 3 presents the distribution ofsizes of the equivalence classes where the largest equivalenceclass contains 8481 equivalent entities and 16 million (741)equivalence classes contain the minimum two equivalent identifi-ers Fig 4 shows a similar distribution but for the number of PLDsextracted from URIs in each equivalence class Interestingly themajority (571) of equivalence classes contained identifiers from

more than one PLD where the most diverse equivalence class con-tained URIs from 32 PLDs

Table 6 shows the canonical URIs for the largest five equiva-lence classes we manually inspected the results and showwhether or not the results were verified as correctincorrect In-deed results for class 1 and 2 were deemed incorrect due toover-use of owlsameAs for linking drug-related entities in theDailyMed and LinkedCT exporters Results 3 and 5 were verifiedas correct consolidation of prominent Semantic Web relatedauthors resp Dieter Fensel and Rudi Studermdashauthors are gi-ven many duplicate URIs by the RKBExplorer coreference index12

Result 4 contained URIs from various sites generally referring tothe United States mostly from DBPedia and LastFM With respectto the DBPedia URIs these (i) were equivalent but for capitilisationvariations or stop-words (ii) were variations of abbreviations or va-lid synonyms (iii) were different language versions (eg dbpediaEtats_Unis) (iv) were nicknames (eg dbpediaYankee_land)(v) were related but not equivalent (eg dbpediaAmerican_Civilization) (vi) were just noise (eg dbpediaLOL_Dean)

Besides the largest equivalence classes which we have seen areprone to errors perhaps due to the snowballing effect of the tran-sitive and symmetric closure we also manually evaluated a ran-dom sample of results For the sampling we take the closed

1

s24

Table 6Largest five equivalence classes (from [32])

Canonical term (lexically first in equivalence class) Size Correct

1 httpbio2rdforgdailymed_drugs1000 84812 httpbio2rdforgdailymed_drugs1042 8003 httpacmrkbexplorercomidperson-53292-22877d02973d0d01e8f29c7113776e7e 443 U

4 httpagame2teachcomddb61cae0e083f705f65944cc3bb3968ce3f3ab59-ge_1 353 U5 httpwwwacmrkbexplorercomidperson-236166-1b4ef5fdf4a5216256064c45a8923bc9 316 U

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 87

equivalence classes extracted from the full corpus pick an identi-fier (possibly non-canonical) at random and then randomly pickanother coreferent identifier from its equivalence class We thenretrieve the data associated for both identifiers and manually in-spect them to see if they are in fact equivalent For each pairwe then select one of four options SAME DIFFERENT UNCLEAR TRIVIALLY

SAME We used the UNCLEAR option sparingly and only where therewas not enough information to make an informed decision or fordifficult subjective choices note that where there was not enoughinformation in our corpus we tried to dereference more informa-tion to minimise use of UNCLEAR in terms of difficult subjectivechoices one example pair we marked unclear was dbpediaFolk-core and dbpediaExperimental_folk We used the TRIVIALLY

SAME option to indicate that an equivalence is purely lsquolsquosyntacticrsquorsquowhere one identifier has no information other than being the targetof a owlsameAs link although the identifiers are still coreferentwe distinguish this case since the equivalence is not so lsquolsquomeaning-fulrsquorsquo For the 1000 pairs we choose option TRIVIALLY SAME 661 times(661) SAME 301 times (301) DIFFERENT 28 times (28) and UN-

CLEAR 10 times (1) In summary our manual verification of thebaseline results puts the precision at 972 albeit with manysyntatic equivalences found Of those found to be incorrect manywere closely-related DBpedia resources that were linked to by thesame resource on the Freebase exporter an example of this wasdbpediaRock_Hudson and dbpediaRock_Hudson_filmogra-phy having owlsameAs links from the same Freebase conceptfbrock_hudson

Moving on in Table 7 we give the most frequently co-occurringPLD-pairs in our equivalence classes where datasets resident onthese domains are lsquolsquoheavilyrsquorsquo interlinked with owl sameAs rela-tions We italicise the indexes of inter-domain links that were ob-served to often be purely syntactic

With respect to consolidation identifiers in 786 million subjectpositions (7 of subject positions) and 232 million non-rdftype-object positions (26) were rewritten giving a total of1019 million positions rewritten (51 of total rewritable posi-tions) The average number of documents mentioning each URIrose slightly from 4691 to 4719 (a 06 increase) due to consoli-dation and the average number of PLDs also rose slightly from1005 to 1007 (a 02 increase)

Table 7Top 10 PLD pairs co-occurring in the equivalence classes with number of equivalenceclasses they co-occur in

PLD PLD Co-occur

1 rdfizecom uriburnercom 5066232 dbpediaorg freebasecom 1870543 bio2rdforg purlorg 1853924 locgov infolcauthorities

a 1666505 l3sde rkbexplorercom 1258426 bibsonomyorg l3sde 1258327 bibsonomyorg rkbexplorercom 1258308 dbpediaorg mpiide 998279 freebasecom mpiide 8961010 dbpediaorg umbelorg 65565

a In fact these were within the same PLD

5 Extended reasoning consolidation

We now look at extending the baseline approach to includemore expressive reasoning capabilities in particular using OWL 2RLRDF rules (introduced previously in Section 26) to infer novelowlsameAs relations that can then be used for consolidationalongside the explicit relations asserted by publishers

As described in Section 22 OWL provides various featuresmdashincluding functional properties inverse-functional properties andcertain cardinality restrictionsmdashthat allow for inferring novelowlsameAs data Such features could ease the burden on publish-ers of providing explicit owlsameAs mappings to coreferent iden-tifiers in remote naming schemes For example inverse-functionalproperties can be used in conjunction with legacy identificationschemesmdashsuch as ISBNs for books EAN UCC-13 or MPN for prod-ucts MAC addresses for network-enabled devices etcmdashto boot-strap identity on the Web of Data within certain domains suchidentification values can be encoded as simple datatype stringsthus bypassing the requirement for bespoke agreement or map-pings between URIs Also lsquolsquoinformation resourcesrsquorsquo with indigenousURIs can be used for indirectly identifying related resources towhich they have a functional mapping where examples includepersonal email-addresses personal homepages WIKIPEDIA articlesetc

Although Linked Data literature has not explicitly endorsed orencouraged such usage prominent grass-roots efforts publishingRDF on the Web rely (or have relied) on inverse-functional proper-ties for maintaining consistent identity For example in the FriendOf A Friend (FOAF) community a technique called smushing wasproposed to leverage such properties for identity serving as anearly precursor to methods described herein13

Versus the baseline consolidation approach we must first ap-ply some reasoning techniques to infer novel owlsameAs rela-tions Once we have the inferred owlsameAs data materialisedand merged with the asserted owlsameAs relations we can thenapply the same consolidation approach as per the baseline (i) ap-ply the transitive symmetric closure and generate the equivalenceclasses (ii) pick canonical identifiers for each class and (iii) re-write the corpus However in contrast to the baseline we expectthe extended volume of owlsameAs data to make in-memorystorage infeasible14 Thus in this section we avoid the need tostore equivalence information in-memory where we instead inves-tigate batch-processing methods for applying an on-disk closure ofthe equivalence relations and likewise on-disk methods for rewrit-ing data

Early versions of the local batch-processing [31] techniques andsome of the distributed reasoning techniques [33] are borrowedfrom previous works Herein we reformulate and combine thesetechniques into a complete solution for applying enhanced consol-idation in a distributed scalable manner over large-scale Linked

13 cf httpwikifoaf-projectorgwSmushing retr 2011012214 Further note that since all machines must have all owlsameAs information

adding more machines does not increase the capacity for handling more suchrelations the scalability of the baseline approach is limited by the machine with theleast in-memory capacity

88 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Data corpora We provide new evaluation over our 1118 billionquadruple evaluation corpus presenting new performance resultsover the full corpus and for a varying number of slave machinesWe also analyse in depth the results of applying the techniquesover our evaluation corpus analysing the applicability of ourmethods in such scenarios We contrast the results given by thebaseline and extended consolidation approaches and again manu-ally evaluate the results of the extended consolidation approach for1000 pairs found to be coreferent

51 High-level approach

In Table 2 we provide the pertinent rules for inferring new owl

sameAs relations from the data However after analysis of thedata we observed that no documents used the owlmaxQuali-

fiedCardinality construct required for the cls-maxqc rulesand that only one document defined one owlhasKey axiom15

involving properties with less than five occurrences in the datamdashhence we leave implementation of these rules for future work andnote that these new OWL 2 constructs have probably not yet had timeto find proper traction on the Web Thus on top of inferencing involv-ing explicit owlsameAs we are left with rule prp-fp which supportsthe semantics of properties typed owlFunctionalProperty andrule prp-ifp which supports the semantics of properties typedowlInverseFunctionalProperty and rule cls-maxc1 whichsupports the semantics of classes with a specified cardinality of 1for some defined property (a class restricted version of the func-tional-property inferencing)16

Thus we look at using OWL 2 RLRDF rules prp-fp prp-ifp andcls-maxc2 for inferring new owl sameAs relations between indi-vidualsmdashwe also support an additional rule that gives an exact car-dinality version of cls-maxc217 Herein we refer to these rules asconsolidation rules

We also investigate pre-applying more general OWL 2 RLRDFreasoning over the corpus to derive more complete results wherethe ruleset is available in Table 3 and is restricted to OWL 2 RLRDFrules with one assertional pattern [33] unlike the consolidationrules these general rules do not require the computation of asser-tional joins and thus are amenable to execution by means of a sin-gle-scan of the corpus We have demonstrated this profile ofreasoning to have good competency with respect to the features ofRDFS and OWL used in Linked Data [29] These general rules mayindirectly lead to the inference of additional owlsameAs relationsparticularly when combined with the consolidation rules of Table 2

Once we have used reasoning to derive novel owlsameAs

data we then apply an on-disk batch processing technique toclose the equivalence relation and to ultimately consolidate thecorpus

Thus our high-level approach is as follows

(i) extract relevant terminological data from the corpus(ii) bind the terminological patterns in the rules from this data

thus creating a larger set of general rules with only oneassertional pattern and identifying assertional patterns thatare useful for consolidation

(iii) apply general-rules over the corpus and buffer any inputinferred statements relevant for consolidation to a new file

(iv) derive the closure of owlsameAs statements from the con-solidation-relevant dataset

15 httphuemerlstadlernetrolerhrdf16 Note that we have presented non-distributed batch-processing execution of

these rules at a smaller scale (147 million quadruples) in [31]17 Exact cardinalities are disallowed in OWL 2 RL due to their effect on the formal

proposition of completeness underlying the profile but such considerations are mootin our scenario

(v) apply consolidation over the main corpus with respect to theclosed owlsameAs data

In Step (i) we extract terminological datamdashrequired for applica-tion of our rulesmdashfrom the main corpus As discussed in Section 25to help ensure robustness we apply authoritative reasoning inother words at this stage we discard any third-party terminologi-cal statements that affect inferencing over instance data for remoteclasses and properties

In Step (ii) we then compile the authoritative terminologicalstatements into the rules to generate a set of T-ground (purelyassertional) rules which can then be applied over the entirety ofthe corpus [33] As mentioned in Section 26 the T-ground generalrules only contain one assertional pattern and thus are amenableto execution by a single scan eg given the statement

homepage rdfssubPropertyOf isPrimaryTopicOf

we would generate a T-ground rulex homepage y

)x isPrimaryTopicOf y

which does not require the computation of joins We also bindthe terminological patterns of the consolidation rules in Table 2however these rules cannot be performed by means of a singlescan over the data since they contain multiple assertional pat-terns in the body Thus we instead extract triple patternssuch as

x isPrimaryTopicOf y

which may indicate data useful for consolidation (note that hereisPrimaryTopicOf is assumed to be inverse-functional)

In Step (iii) we then apply the T-ground general rules over thecorpus with a single scan Any input or inferred statements match-ing a consolidation-relevant patternmdashincluding any owlsameAs

statements foundmdashare buffered to a file for later join computationSubsequently in Step (iv) we must now compute the on-disk

canonicalised closure of the owlsameAs statements In particularwe mainly employ the following three on-disk primitives

(i) sequential scans of flat files containing line-delimitedtuples18

(ii) external-sorts where batches of statements are sorted inmemory the sorted batches written to disk and the sortedbatches merged to the final output and

(iii) merge-joins where multiple sets of data are sorted accordingto their required join position and subsequently scanned inan interleaving manner that aligns on the join position andwhere an in-memory join is applied for each individual joinelement

Using these primitives to compute the owlsameAs closureminimises the amount of main memory required where we havepresented similar approaches in [3031]

First assertional memberships of functional properties inverse-functional properties and cardinality restrictions (both propertiesand classes) are written to separate on-disk files For functional-property and cardinality reasoning a consistent join variable forthe assertional patterns is given by the subject position for in-verse-functional-property reasoning a join variable is given bythe object position19 Thus we can sort the former sets of data

8 These files are G-Zip compressed flat files of N-Triple-like syntax encodingrbitrary length tuples of RDF constants9 Although a predicate-position join is also available we prefer object-positionins that provide smaller batches of data for the in-memory join

1

a1

jo

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 89

according to subject and perform a merge-join by means of a linearscan thereafter the same procedure applies to the latter set sortingand merge-joining on the object position Applying merge-join scanswe produce new owlsameAs statements

Both the originally asserted and newly inferred owlsameAs

relations are similarly written to an on-disk file over which wenow wish to perform the canonicalised symmetrictransitive clo-sure We apply a similar method again leveraging external sortsand merge-joins to perform the computation (herein we sketchand point the interested reader to [31 Section 46]) In the follow-ing we use gt and lt to denote lexical ordering SA as a shortcut forowlsameAs a b c etc to denote members of U [ B such thata lt b lt c and define URIs to be lexically lower than blank nodesThe process is as follows

(i) we only materialise symmetric equality relations thatinvolve a (possibly intermediary) canonical term chosenby a lexical ordering given b SA a we materialise a SA bgiven a SA b SA c we materialise the relations a SA b aSA c and their inverses but do not materialise b SA c orits inverse

(ii) transitivity is supported by iterative merge-join scans

20 One could instead build an on-disk map for equivalence classes and pivoelements and follow a consolidation method similar to the previous section over theunordered data however we would expect such an on-disk index to have a low cachehit-rate given the nature of the data which would lead to many (slow) disk seeks Analternative approach might be to hash-partition the corpus and equality index overdifferent machines however this would require a non-trivial minimum amount omemory on the cluster

ndash in the scan if we find c SA a (sorted by object) and c SA d(sorted naturally) we infer a SA d and drop the non-canonical c SA d (and d SA c)

ndash at the end of the scan newly inferred triples are marked and merge-joined

into the main equality data any triples echoing an earlier inference are ignored dropped non-canonical statements are removed

ndash the process is then iterative in the next scan if we find dSA a and d SA e we infer a SA e and e SA a

ndash inferences will only occur if they involve a statementadded in the previous iteration ensuring that inferencesteps are not re-computed and that the computation willterminate

(iii) the above iterations stop when a fixpoint is reached andnothing new is inferred

(iv) the process of reaching a fixpoint is accelerated using avail-able main-memory to store a cache of partial equalitychains

The above steps follow well-known methods for transtive clo-sure computation modified to support the symmetry ofowlsameAs and to use adaptive canonicalisation in order toavoid quadratic output With respect to the last item we useAlgorithm 1 to derive lsquolsquobatchesrsquorsquo of in-memory equivalencesand when in-memory capacity is achieved we write thesebatches to disk and proceed with on-disk computation this isparticularly useful for computing the small number of longequality chains which would otherwise require sorts andmerge-joins over all of the canonical owlsameAs data currentlyderived and where the number of iterations would otherwise bethe length of the longest chain The result of this process is a setof canonicalised equality relations representing the symmetrictransitive closure

Next we briefly describe the process of canonicalising data withrespect to this on-disk equality closure where we again use exter-nal-sorts and merge-joins First we prune the owlsameAs indexto only maintain relations s1 SA s2 such that s1 gt s2 thus given s1

SA s2 we know that s2 is the canonical identifier and s1 is to berewritten We then sort the data according to the position thatwe wish to rewrite and perform a merge-join over both sets ofdata buffering the canonicalised data to an output file If we wantto rewrite multiple positions of a file of tuples (eg subject and

object) we must rewrite one position sort the results by the sec-ond position and then rewrite the second position20

Finally note that in the derivation of owlsameAs from the con-solidation rules prp-fp prp-ifp cax-maxc2 the overall processmay be iterative For example consider

dblpAxel foafisPrimaryTopicOf lthttppolleresnetgt

axelme foafisPrimaryTopicOf lthttpaxelderiiegt

lthttppolleresnetgt owlsameAs lthttpaxelderiiegt

from which the conclusion that dblpAxel is the same asaxelme holds We see that new owlsameAs relations (either as-serted or derived from the consolidation rules) may in turn lsquolsquoalignrsquorsquoterms in the join position of the consolidation rules leading to newequivalences Thus for deriving the final owlsameAs we require ahigher-level iterative process as follows

(i) initially apply the consolidation rules and append theresults to a file alongside the asserted owl sameAs state-ments found

(ii) apply the initial closure of the owlsameAs data(iii) then iteratively until no new owlsameAs inferences are

found

ndash rewrite the join positions of the on-disk files containing

the data for each consolidation rule according to the cur-rent owlsameAs data

ndash derive new owlsameAs inferences possible through theprevious rewriting for each consolidation rule

ndash re-derive the closure of the owlsameAs data includingthe new inferences

The final closed file of owlsameAs data can then be re-used torewrite the main corpus in two sorts and merge-join scans oversubject and object

52 Distributed approach

The distributed approach follows quite naturally from theprevious discussion As before we assume that the input dataare evenly pre-distributed over the slave machines (in anyarbitrary ordering) where we can then apply the followingprocess

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract relevant terminologicalknowledge

(ii) Gather gather terminological data onto the master machineand thereafter bind the terminological patterns of the gen-eralconsolidation rules

(iii) Flood flood the rules for reasoning and the consolidation-relevant patterns to all slave machines

(iv) Run apply reasoning and extract consolidation-relevantstatements from the input and inferred data

(v) Gather gather all consolidation statements onto the mastermachine then in parallel

t

f

Table 8Breakdown of timing of distributed extended consolidation wreasoning where identifies the masterslave tasks that run concurrently

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 4543 1000 2302 1000 1275 1000 813 1000 7404 1000Executing 299 66 303 131 325 255 301 370 2436 329

Aggregate Consolidation Relevant Data 43 09 45 20 47 37 46 57 299 40Closing owlsameAs

244 54 245 106 254 199 244 300 2112 285Miscellaneous 12 03 13 05 24 19 11 13 25 03

Idle (waiting for slaves) 4245 934 1999 869 950 745 512 630 4968 671

SlaveAvg Executing (total) 4489 988 2212 961 1116 875 565 694 5991 809

Extract Terminology 444 98 227 99 118 92 69 84 494 67Extract Consolidation Relevant Data 955 210 483 210 243 191 132 162 1379 186Initial Sort (by subject) 980 216 483 210 237 186 116 142 1338 181Consolidation 2110 464 1019 443 519 407 249 306 2780 375

Avg Idle 55 12 89 39 379 297 236 290 1413 191Waiting for peers 00 00 32 14 71 56 63 77 375 51Waiting for master 55 12 58 25 308 241 173 212 1038 140

90 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

ndash Local compute the closure of the consolidation rules andthe owlsameAs data on the master machine

ndash Run each slave machine sorts its fragment of the maincorpus according to natural order (spoc)

1 At least in terms of pure quantity However we do not give an indication of theuality or importance of those few equivalences we miss with this approximationhich may well be application specific

(vi) Flood send the closed owlsameAs data to the slavemachines once the distributed sort has been completed

(vii) Run each slave machine then rewrites the subjects of theirsegment of the corpus subsequently sorts the rewritten databy object and then rewrites the objects (of non-rdftypetriples) with respect to the closed owlsameAs data

53 Performance evaluation

Applying the above process to our 1118 billion quadruple corpustook 1234 h Extracting the terminological data took 114 h with anaverage idle time of 19 min (277) (one machine took 18 minlonger than the rest due to processing a large ontology containing2 million quadruples [32]) Merging and aggregating the termino-logical data took roughly 1 min Applying the reasoning andextracting the consolidation relevant statements took 234 h withan average idle time of 25 min (18) Aggregating and mergingthe consolidation relevant statements took 299 min Thereafterlocally computing the closure of the consolidation rules and theequality data took 352 h with the computation requiring two itera-tions overall (the minimum possiblemdashthe second iteration did notproduce any new results) Concurrent to the previous step the paral-lel sort of remote data by natural order took 233 h with an averageidle time of 6 min (43) Subsequent parallel consolidation of thedata took 48 h with 10 min (35) average idle timemdashof this 19of the time was spent consolidating the pre-sorted subjects 60of the time was spent sorting the rewritten data by object and21 of the time was spent consolidating the objects of the data

As before Table 8 summarises the timing of the task Focusing onthe performance for the entire corpus the master machine took406 h to coordinate global knowledge constituting the lower boundon time possible for the task to execute with respect to increasingmachines in our setupmdashin future it may be worthwhile to investigatedistributed strategies for computing the owlsameAs closure(which takes 285 of the total computation time) but for the mo-ment we mitigate the cost by concurrently running a sort on theslave machines thus keeping the slaves busy for 634 of the timetaken for this local aggregation step The slave machines were onaverage busy for 809 of the total task time of the idle time733 was spent waiting for the master machine to aggregate theconsolidation relevant data and to finish the closure of owl sameAs

data and the balance (267) was spent waiting for peers to finish(mostly during the extraction of terminological data)

With regards performance evaluation over the 100 million qua-druple corpus for a varying number of slave machines perhapsmost notably the average idle time of the slave machines increasesquite rapidly as the number of machines increases In particular by4 machines the slaves can perform the distributed sort-by-subjecttask faster than the master machine can perform the local owlsameAs closure The significant workload of the master machinewill eventually become a bottleneck as the number of slave ma-chines increases further This is also noticeable in terms of overalltask time the respective time savings as the number of slave ma-chines doubles (moving from 1 slave machine to 8) are 0507 0554 and 0638 respectively

Briefly we also ran the consolidation without the general rea-soning rules (Table 3) motivated earlier With respect to perfor-mance the main variations were given by (i) the extraction ofconsolidation relevant statementsmdashthis time directly extractedfrom explicit statements as opposed to explicit and inferred state-mentsmdashwhich took 154 min (11 of the time taken including thegeneral reasoning) with an average idle time of less than one min-ute (6 average idle time) (ii) local aggregation of the consolida-tion relevant statements took 17 min (569 of the time takenpreviously) (iii) local closure of the owlsameAs data took318 h (904 of the time taken previously) The total time savedequated to 28h (227) where 333 min were saved from coordi-nation on the master machine and 225 h were saved from parallelexecution on the slave machines

54 Results evaluation

In this section we present the results of the consolidationwhich included the general reasoning step in the extraction ofthe consolidation-relevant statements In fact we found that theonly major variation between the two approaches was in theamount of consolidation-relevant statements collected (discussedpresently) where the changes in the total amounts of coreferentidentifiers equivalence classes and canonicalised terms were infact negligible (lt01) Thus for our corpus extracting only as-serted consolidation-relevant statements offered a very closeapproximation of the extended reasoning approach21

2

qw

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 9: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

84 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

non-rdftype triples URIs occurred in on average 3996 docu-ments and 0727 PLDs

The URI with the most data-level occurrences (166 million)was httpidentica the URI with the most re-use acrossdocuments (appearing in 1793 thousand documents) washttpcreativecommonsorglicensesby30 the URIwith the most re-use across PLDs (appearing in 80 different do-mains) was httpwwwldoddscomfoaffoaf-a-maticAlthough some URIs do enjoy widespread re-use across differentdocuments and domains in Figs 1 and 2 we give the distribu-tion of re-use of URIs across documents and across PLDs wherea powerndashlaw relationship is roughly evidentmdashagain the majorityof URIs only appear in one document (61) or in one PLD(99)

From this analysis we can conclude that with respect to data-level terms in our corpus

ndash blank-nodes which by their very nature cannot be re-usedacross documents are 18 more prevalent than URIs

ndash despite a smaller number of unique URIs each one is used in(probably coincidentally) 18 more triples

ndash unlike blank-nodes URIs commonly only appear in either a sub-ject position or an object position

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100 1000 10000 100000 1e+006

num

ber o

f UR

Is

number of documents mentioning URI

Fig 1 Distribution of URIs and the number of documents they appear in (in a data-position)

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1 10 100

num

ber o

f UR

Is

number of PLDs mentioning URI

Fig 2 Distribution of URIs and the number of PLDs they appear in (in a data-position)

ndash each URI is re-used on average in 47 documents but usuallyonly within the same domainmdashmost external re-use is in theobject position of a triple

ndash 99 of URIs appear in only one PLD

We can conclude that within our corpusmdashitself a general crawlfor RDFXML on the Webmdashwe find that there is only sparse re-use of data-level terms across sources and particularly acrossdomains

Finally for the purposes of demonstrating performance acrossvarying numbers of machines we extract a smaller corpus com-prising of 100 million quadruples from the full evaluation corpusWe extract the sub-corpus from the head of the raw corpus sincethe data are ordered by access time polling statements from thehead roughly emulates a smaller crawl of data which shouldensure eg that all well-linked vocabularies are containedtherein

4 Base-line consolidation

We now present the lsquolsquobase-linersquorsquo algorithm for consolidationthat consumes asserted owlsameAs relations in the dataLinked Data best-practices encourage the provision ofowlsameAs links between exporters that coin different URIsfor the same entities

lsquolsquoIt is common practice to use the owlsameAs property for statingthat another data source also provides information about a specificnon-information resourcersquorsquo [7]

We would thus expect there to be a significant amount of expli-cit owlsameAs data present in our corpusmdashprovided directly bythe Linked Data publishers themselvesmdashthat can be directly usedfor consolidation

To perform consolidation over such data the first step is toextract all explicit owlsameAs data from the corpus We mustthen compute the transitive and symmetric closure of the equiv-alence relation and build equivalence classes sets of coreferentidentifiers Note that the set of equivalence classes forms a parti-tion of coreferent identifiers where due to the transitive andsymmetric semantics of owlsameAs each identifier can only ap-pear in one such class Also note that we do not need to considersingleton equivalence classes that contain only one identifierconsolidation need not perform any action if an identifier isfound only to be coreferent with itself Once the closure of as-serted owlsameAs data has been computed we then need tobuild an index that maps identifiers to the equivalence class inwhich it is contained This index enables lookups of coreferentidentifiers Next in the consolidated data we would like to col-lapse the data mentioning identifiers in each equivalence classto instead use a single consistent canonical identifier for eachequivalence class we must thus choose a canonical identifier Fi-nally we can scan the entire corpus and using the equivalenceclass index rewrite the original identifiers to their canonincalform As an optional step we can also add links between eachcanonical identifier and its coreferent forms using an artificalowlsameAs relation in the output thus persisting the originalidentifiers

The distributed algorithm presented in this section is basedon an in-memory equivalence class closure and index and isthe current method of consolidation employed by the SWSE sys-tem described previously in [32] Herein we briefly reintroducethe approach from [32] where we also add new performanceevaluation over varying numbers of machines in the distributedsetup present more discussion and analysis of the use ofowlsameAs in our Linked Data corpus and manually evaluatethe precision of the approach for an extended sample of one

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 85

thousand coreferent pairs In particular this section serves as abaseline for comparison against the extended consolidation ap-proach that uses richer reasoning features explored later inSection 5

41 High-level approach

Based on the previous discussion the approach is straight-forward

(i) scan the corpus and separate out all asserted owlsameAs

relations from the main body of the corpus(ii) load these relations into an in-memory index that encodes

the transitive and symmetric semantics of owlsameAs(iii) for each equivalence class in the index choose a canonical

term(iv) scan the corpus again canonicalising any term in the

subject position or object position of an rdftype

triple

Thus we need only index a small subset of the corpusmdashowlsameAs statementsmdashand can apply consolidation by meansof two scans

The non-trivial aspects of the algorithm are given by theequality closure and index To perform the in-memory transi-tive symmetric closure we use a traditional unionndashfind algo-rithm [6640] for computing equivalence partitions where (i)equivalent elements are stored in common sets such that eachelement is only contained in one set (ii) a map provides a find

function for looking up which set an element belongs to and(iii) when new equivalent elements are found the sets they be-long to are unioned The process is based on an in-memory mapindex and is detailed in Algorithm 1 where

(i) the eqc labels refer to equivalence classes and s and o refer toRDF subject and object terms

(ii) the function mapget refers to an identifier lookup on the in-memory index that should return the intermediary equiva-lence class associated with that identifier (ie find)

(iii) the function mapput associates an identifier with a newequivalence class

The output of the algorithm is an in-memory map from identi-fiers to their respective equivalence class

Next we must choose a canonical term for each equivalenceclass we prefer URIs over blank-nodes thereafter choosing a termwith the lowest alphabetical ordering a canonical term is thusassociated to each equivalent set Once the equivalence index hasbeen finalised we re-scan the corpus and canonicalise the datausing the in-memory index to service lookups

42 Distributed approach

Again distribution of the approach is fairly intuitive as follows

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract owlsameAs relations

(ii) Gather gather all owlsameAs relations onto the mastermachine and build the in-memory equality index

(iii) Floodrun send the equality index (in its entirety) to eachslave machine and apply the consolidation scan in parallel

As we will see in the next section the most expensive meth-odsmdashinvolving the two scans of the main corpusmdashcan be con-ducted in parallel

Algorithm 1

Building equivalence map [32]

Require SAMEAS DATA SA1 map 2 for (sowlsameAso) 2 SA ^ s ndash o do3 eqcs mapget(s)4 eqco mapget(o)5 if eqcs = ^ eqco = then6 eqcs[o so7 mapput(seqcs[o)8 mapput(oeqcs[o)9 else if eqcs = then

10 add s to eqco

11 mapput(seqco)12 else if eqco = then13 add o to eqcs

14 mapput(oeqcs)15 else if eqcs ndash eqco then16 if jeqcsj gt jeqcoj17 add all eqco into eqcs

18 for eo 2 eqco do19 mapput(eoeqcs)20 end for21 else22 add all eqcs into eqco

23 for es 2 eqcs do24 mapput(eseqco)25 end for26 end if27 end if28 end for

43 Performance evaluation

We applied the distributed base-line consolidation over our cor-pus with the aformentioned procedure and setup The entire con-solidation process took 633 min with the bulk of time taken asfollows the first scan extracting owlsameAs statements took125 min with an average idle time for the servers of 11 s(14)mdashie on average the slave machines spent 14 of the timeidly waiting for peers to finish Transferring aggregating and load-ing the owlsameAs statements on the master machine took84 min The second scan rewriting the data according to thecanonical identifiers took in total 423 min with an average idletime of 647 s (25) for each machine at the end of the roundThe slower time for the second round is attributable to the extraoverhead of re-writing the GZip-compressed data to disk as op-posed to just reading

In the rightmost column of Table 5 (full-8) we give a break-down of the timing for the tasks over the full corpus using eightslave machines and one master machine Independent of the num-ber of slaves we note that the master machine required 85 min forcoordinating globally-required owlsameAs knowledge and thatthe rest of the task time is spent in embarrassingly parallel execu-tion (amenable to reduction by increasing the number of ma-chines) For our setup the slave machines were kept busy for onaverage 846 of the total task time of the idle time 87 wasspent waiting for the master to coordinate the owlsameAs dataand 13 was spent waiting for peers to finish their task due tosub-optimal load balancing The master machine spent 866 ofthe task idle waiting for the slaves to finish

In addition Table 5 also gives performance results for onemaster and varying numbers of slave machines (1248) over the

Table 5Breakdown of timing of distributed baseline consolidation

Category 1 2 4 8 Full-8

Min Min Min min Min

MasterTotal execution time 422 1000 223 1000 119 1000 67 1000 633 1000Executing 05 11 05 22 05 40 05 75 85 134

Aggregating owl sameAs 04 11 05 22 05 40 05 75 84 133Miscellaneous 01 01 01 04 01 01

Idle (waiting for slaves) 418 989 218 977 114 958 62 921 548 866

SlavesAvg Executing (total) 418 989 218 977 114 958 62 921 535 846

Extract owl sameAs 90 214 45 202 22 184 11 168 123 195Consolidate 327 775 171 769 88 735 47 705 412 651

Avg Idle 05 11 07 29 10 82 08 127 98 154Waiting for peers 00 00 01 07 05 40 03 47 13 20Waiting for master 05 11 05 23 05 42 05 79 85 134

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000

num

ber o

f cla

sses

equivalence class size

Fig 3 Distribution of sizes of equivalence classes on loglog scale (from [32])

1

10

100

1000

10000

100000

1e+006

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

Fig 4 Distribution of the number of PLDs per equivalence class on loglog scale

2 For example see the coreference results given by httpwwwrkbexplorercoma m e A s u r i = h t t p w w w a c m r k b e x p l o r e r c o m i d p e r s o n - 5 3 2 9 2 -2877d02973d0d01e8f29c7113776e7e which at the time of writing correspond to36 out of the 443 equivalent URIs found for Dieter Fensel

86 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

100 million quadruple corpus Note that in the table we use thesymbol to indicate a negligible value (0 lt lt 005) We see thatas the number of slave machines increases so too does the percent-age of time taken to aggregate global knowledge on the master ma-chine (the absolute time stays roughly stable) This indicates thatfor a high number of machines the aggregation of global knowl-edge will eventually become a bottleneck and an alternative distri-bution strategy may be preferable subject to further investigationOverall however we see that the total execution times roughlyhalve when the number of slave machines are doubled we observea 0528 0536 and 0561 reduction in total task executiontime moving from 1 slave machine to 2 2 to 4 and 4 to 8 respec-tively (here a value of 05 would indicate linear scaling out)

44 Results evaluation

We extracted 1193 million raw owlsameAs quadruples fromthe full corpus Of these however there were only 377 million un-ique triples After closure the data formed 216 million equivalenceclasses mentioning 575 million terms (624 of URIs)mdashan averageof 265 elements per equivalence class Of the 575 million termsonly 4156 were blank-nodes Fig 3 presents the distribution ofsizes of the equivalence classes where the largest equivalenceclass contains 8481 equivalent entities and 16 million (741)equivalence classes contain the minimum two equivalent identifi-ers Fig 4 shows a similar distribution but for the number of PLDsextracted from URIs in each equivalence class Interestingly themajority (571) of equivalence classes contained identifiers from

more than one PLD where the most diverse equivalence class con-tained URIs from 32 PLDs

Table 6 shows the canonical URIs for the largest five equiva-lence classes we manually inspected the results and showwhether or not the results were verified as correctincorrect In-deed results for class 1 and 2 were deemed incorrect due toover-use of owlsameAs for linking drug-related entities in theDailyMed and LinkedCT exporters Results 3 and 5 were verifiedas correct consolidation of prominent Semantic Web relatedauthors resp Dieter Fensel and Rudi Studermdashauthors are gi-ven many duplicate URIs by the RKBExplorer coreference index12

Result 4 contained URIs from various sites generally referring tothe United States mostly from DBPedia and LastFM With respectto the DBPedia URIs these (i) were equivalent but for capitilisationvariations or stop-words (ii) were variations of abbreviations or va-lid synonyms (iii) were different language versions (eg dbpediaEtats_Unis) (iv) were nicknames (eg dbpediaYankee_land)(v) were related but not equivalent (eg dbpediaAmerican_Civilization) (vi) were just noise (eg dbpediaLOL_Dean)

Besides the largest equivalence classes which we have seen areprone to errors perhaps due to the snowballing effect of the tran-sitive and symmetric closure we also manually evaluated a ran-dom sample of results For the sampling we take the closed

1

s24

Table 6Largest five equivalence classes (from [32])

Canonical term (lexically first in equivalence class) Size Correct

1 httpbio2rdforgdailymed_drugs1000 84812 httpbio2rdforgdailymed_drugs1042 8003 httpacmrkbexplorercomidperson-53292-22877d02973d0d01e8f29c7113776e7e 443 U

4 httpagame2teachcomddb61cae0e083f705f65944cc3bb3968ce3f3ab59-ge_1 353 U5 httpwwwacmrkbexplorercomidperson-236166-1b4ef5fdf4a5216256064c45a8923bc9 316 U

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 87

equivalence classes extracted from the full corpus pick an identi-fier (possibly non-canonical) at random and then randomly pickanother coreferent identifier from its equivalence class We thenretrieve the data associated for both identifiers and manually in-spect them to see if they are in fact equivalent For each pairwe then select one of four options SAME DIFFERENT UNCLEAR TRIVIALLY

SAME We used the UNCLEAR option sparingly and only where therewas not enough information to make an informed decision or fordifficult subjective choices note that where there was not enoughinformation in our corpus we tried to dereference more informa-tion to minimise use of UNCLEAR in terms of difficult subjectivechoices one example pair we marked unclear was dbpediaFolk-core and dbpediaExperimental_folk We used the TRIVIALLY

SAME option to indicate that an equivalence is purely lsquolsquosyntacticrsquorsquowhere one identifier has no information other than being the targetof a owlsameAs link although the identifiers are still coreferentwe distinguish this case since the equivalence is not so lsquolsquomeaning-fulrsquorsquo For the 1000 pairs we choose option TRIVIALLY SAME 661 times(661) SAME 301 times (301) DIFFERENT 28 times (28) and UN-

CLEAR 10 times (1) In summary our manual verification of thebaseline results puts the precision at 972 albeit with manysyntatic equivalences found Of those found to be incorrect manywere closely-related DBpedia resources that were linked to by thesame resource on the Freebase exporter an example of this wasdbpediaRock_Hudson and dbpediaRock_Hudson_filmogra-phy having owlsameAs links from the same Freebase conceptfbrock_hudson

Moving on in Table 7 we give the most frequently co-occurringPLD-pairs in our equivalence classes where datasets resident onthese domains are lsquolsquoheavilyrsquorsquo interlinked with owl sameAs rela-tions We italicise the indexes of inter-domain links that were ob-served to often be purely syntactic

With respect to consolidation identifiers in 786 million subjectpositions (7 of subject positions) and 232 million non-rdftype-object positions (26) were rewritten giving a total of1019 million positions rewritten (51 of total rewritable posi-tions) The average number of documents mentioning each URIrose slightly from 4691 to 4719 (a 06 increase) due to consoli-dation and the average number of PLDs also rose slightly from1005 to 1007 (a 02 increase)

Table 7Top 10 PLD pairs co-occurring in the equivalence classes with number of equivalenceclasses they co-occur in

PLD PLD Co-occur

1 rdfizecom uriburnercom 5066232 dbpediaorg freebasecom 1870543 bio2rdforg purlorg 1853924 locgov infolcauthorities

a 1666505 l3sde rkbexplorercom 1258426 bibsonomyorg l3sde 1258327 bibsonomyorg rkbexplorercom 1258308 dbpediaorg mpiide 998279 freebasecom mpiide 8961010 dbpediaorg umbelorg 65565

a In fact these were within the same PLD

5 Extended reasoning consolidation

We now look at extending the baseline approach to includemore expressive reasoning capabilities in particular using OWL 2RLRDF rules (introduced previously in Section 26) to infer novelowlsameAs relations that can then be used for consolidationalongside the explicit relations asserted by publishers

As described in Section 22 OWL provides various featuresmdashincluding functional properties inverse-functional properties andcertain cardinality restrictionsmdashthat allow for inferring novelowlsameAs data Such features could ease the burden on publish-ers of providing explicit owlsameAs mappings to coreferent iden-tifiers in remote naming schemes For example inverse-functionalproperties can be used in conjunction with legacy identificationschemesmdashsuch as ISBNs for books EAN UCC-13 or MPN for prod-ucts MAC addresses for network-enabled devices etcmdashto boot-strap identity on the Web of Data within certain domains suchidentification values can be encoded as simple datatype stringsthus bypassing the requirement for bespoke agreement or map-pings between URIs Also lsquolsquoinformation resourcesrsquorsquo with indigenousURIs can be used for indirectly identifying related resources towhich they have a functional mapping where examples includepersonal email-addresses personal homepages WIKIPEDIA articlesetc

Although Linked Data literature has not explicitly endorsed orencouraged such usage prominent grass-roots efforts publishingRDF on the Web rely (or have relied) on inverse-functional proper-ties for maintaining consistent identity For example in the FriendOf A Friend (FOAF) community a technique called smushing wasproposed to leverage such properties for identity serving as anearly precursor to methods described herein13

Versus the baseline consolidation approach we must first ap-ply some reasoning techniques to infer novel owlsameAs rela-tions Once we have the inferred owlsameAs data materialisedand merged with the asserted owlsameAs relations we can thenapply the same consolidation approach as per the baseline (i) ap-ply the transitive symmetric closure and generate the equivalenceclasses (ii) pick canonical identifiers for each class and (iii) re-write the corpus However in contrast to the baseline we expectthe extended volume of owlsameAs data to make in-memorystorage infeasible14 Thus in this section we avoid the need tostore equivalence information in-memory where we instead inves-tigate batch-processing methods for applying an on-disk closure ofthe equivalence relations and likewise on-disk methods for rewrit-ing data

Early versions of the local batch-processing [31] techniques andsome of the distributed reasoning techniques [33] are borrowedfrom previous works Herein we reformulate and combine thesetechniques into a complete solution for applying enhanced consol-idation in a distributed scalable manner over large-scale Linked

13 cf httpwikifoaf-projectorgwSmushing retr 2011012214 Further note that since all machines must have all owlsameAs information

adding more machines does not increase the capacity for handling more suchrelations the scalability of the baseline approach is limited by the machine with theleast in-memory capacity

88 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Data corpora We provide new evaluation over our 1118 billionquadruple evaluation corpus presenting new performance resultsover the full corpus and for a varying number of slave machinesWe also analyse in depth the results of applying the techniquesover our evaluation corpus analysing the applicability of ourmethods in such scenarios We contrast the results given by thebaseline and extended consolidation approaches and again manu-ally evaluate the results of the extended consolidation approach for1000 pairs found to be coreferent

51 High-level approach

In Table 2 we provide the pertinent rules for inferring new owl

sameAs relations from the data However after analysis of thedata we observed that no documents used the owlmaxQuali-

fiedCardinality construct required for the cls-maxqc rulesand that only one document defined one owlhasKey axiom15

involving properties with less than five occurrences in the datamdashhence we leave implementation of these rules for future work andnote that these new OWL 2 constructs have probably not yet had timeto find proper traction on the Web Thus on top of inferencing involv-ing explicit owlsameAs we are left with rule prp-fp which supportsthe semantics of properties typed owlFunctionalProperty andrule prp-ifp which supports the semantics of properties typedowlInverseFunctionalProperty and rule cls-maxc1 whichsupports the semantics of classes with a specified cardinality of 1for some defined property (a class restricted version of the func-tional-property inferencing)16

Thus we look at using OWL 2 RLRDF rules prp-fp prp-ifp andcls-maxc2 for inferring new owl sameAs relations between indi-vidualsmdashwe also support an additional rule that gives an exact car-dinality version of cls-maxc217 Herein we refer to these rules asconsolidation rules

We also investigate pre-applying more general OWL 2 RLRDFreasoning over the corpus to derive more complete results wherethe ruleset is available in Table 3 and is restricted to OWL 2 RLRDFrules with one assertional pattern [33] unlike the consolidationrules these general rules do not require the computation of asser-tional joins and thus are amenable to execution by means of a sin-gle-scan of the corpus We have demonstrated this profile ofreasoning to have good competency with respect to the features ofRDFS and OWL used in Linked Data [29] These general rules mayindirectly lead to the inference of additional owlsameAs relationsparticularly when combined with the consolidation rules of Table 2

Once we have used reasoning to derive novel owlsameAs

data we then apply an on-disk batch processing technique toclose the equivalence relation and to ultimately consolidate thecorpus

Thus our high-level approach is as follows

(i) extract relevant terminological data from the corpus(ii) bind the terminological patterns in the rules from this data

thus creating a larger set of general rules with only oneassertional pattern and identifying assertional patterns thatare useful for consolidation

(iii) apply general-rules over the corpus and buffer any inputinferred statements relevant for consolidation to a new file

(iv) derive the closure of owlsameAs statements from the con-solidation-relevant dataset

15 httphuemerlstadlernetrolerhrdf16 Note that we have presented non-distributed batch-processing execution of

these rules at a smaller scale (147 million quadruples) in [31]17 Exact cardinalities are disallowed in OWL 2 RL due to their effect on the formal

proposition of completeness underlying the profile but such considerations are mootin our scenario

(v) apply consolidation over the main corpus with respect to theclosed owlsameAs data

In Step (i) we extract terminological datamdashrequired for applica-tion of our rulesmdashfrom the main corpus As discussed in Section 25to help ensure robustness we apply authoritative reasoning inother words at this stage we discard any third-party terminologi-cal statements that affect inferencing over instance data for remoteclasses and properties

In Step (ii) we then compile the authoritative terminologicalstatements into the rules to generate a set of T-ground (purelyassertional) rules which can then be applied over the entirety ofthe corpus [33] As mentioned in Section 26 the T-ground generalrules only contain one assertional pattern and thus are amenableto execution by a single scan eg given the statement

homepage rdfssubPropertyOf isPrimaryTopicOf

we would generate a T-ground rulex homepage y

)x isPrimaryTopicOf y

which does not require the computation of joins We also bindthe terminological patterns of the consolidation rules in Table 2however these rules cannot be performed by means of a singlescan over the data since they contain multiple assertional pat-terns in the body Thus we instead extract triple patternssuch as

x isPrimaryTopicOf y

which may indicate data useful for consolidation (note that hereisPrimaryTopicOf is assumed to be inverse-functional)

In Step (iii) we then apply the T-ground general rules over thecorpus with a single scan Any input or inferred statements match-ing a consolidation-relevant patternmdashincluding any owlsameAs

statements foundmdashare buffered to a file for later join computationSubsequently in Step (iv) we must now compute the on-disk

canonicalised closure of the owlsameAs statements In particularwe mainly employ the following three on-disk primitives

(i) sequential scans of flat files containing line-delimitedtuples18

(ii) external-sorts where batches of statements are sorted inmemory the sorted batches written to disk and the sortedbatches merged to the final output and

(iii) merge-joins where multiple sets of data are sorted accordingto their required join position and subsequently scanned inan interleaving manner that aligns on the join position andwhere an in-memory join is applied for each individual joinelement

Using these primitives to compute the owlsameAs closureminimises the amount of main memory required where we havepresented similar approaches in [3031]

First assertional memberships of functional properties inverse-functional properties and cardinality restrictions (both propertiesand classes) are written to separate on-disk files For functional-property and cardinality reasoning a consistent join variable forthe assertional patterns is given by the subject position for in-verse-functional-property reasoning a join variable is given bythe object position19 Thus we can sort the former sets of data

8 These files are G-Zip compressed flat files of N-Triple-like syntax encodingrbitrary length tuples of RDF constants9 Although a predicate-position join is also available we prefer object-positionins that provide smaller batches of data for the in-memory join

1

a1

jo

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 89

according to subject and perform a merge-join by means of a linearscan thereafter the same procedure applies to the latter set sortingand merge-joining on the object position Applying merge-join scanswe produce new owlsameAs statements

Both the originally asserted and newly inferred owlsameAs

relations are similarly written to an on-disk file over which wenow wish to perform the canonicalised symmetrictransitive clo-sure We apply a similar method again leveraging external sortsand merge-joins to perform the computation (herein we sketchand point the interested reader to [31 Section 46]) In the follow-ing we use gt and lt to denote lexical ordering SA as a shortcut forowlsameAs a b c etc to denote members of U [ B such thata lt b lt c and define URIs to be lexically lower than blank nodesThe process is as follows

(i) we only materialise symmetric equality relations thatinvolve a (possibly intermediary) canonical term chosenby a lexical ordering given b SA a we materialise a SA bgiven a SA b SA c we materialise the relations a SA b aSA c and their inverses but do not materialise b SA c orits inverse

(ii) transitivity is supported by iterative merge-join scans

20 One could instead build an on-disk map for equivalence classes and pivoelements and follow a consolidation method similar to the previous section over theunordered data however we would expect such an on-disk index to have a low cachehit-rate given the nature of the data which would lead to many (slow) disk seeks Analternative approach might be to hash-partition the corpus and equality index overdifferent machines however this would require a non-trivial minimum amount omemory on the cluster

ndash in the scan if we find c SA a (sorted by object) and c SA d(sorted naturally) we infer a SA d and drop the non-canonical c SA d (and d SA c)

ndash at the end of the scan newly inferred triples are marked and merge-joined

into the main equality data any triples echoing an earlier inference are ignored dropped non-canonical statements are removed

ndash the process is then iterative in the next scan if we find dSA a and d SA e we infer a SA e and e SA a

ndash inferences will only occur if they involve a statementadded in the previous iteration ensuring that inferencesteps are not re-computed and that the computation willterminate

(iii) the above iterations stop when a fixpoint is reached andnothing new is inferred

(iv) the process of reaching a fixpoint is accelerated using avail-able main-memory to store a cache of partial equalitychains

The above steps follow well-known methods for transtive clo-sure computation modified to support the symmetry ofowlsameAs and to use adaptive canonicalisation in order toavoid quadratic output With respect to the last item we useAlgorithm 1 to derive lsquolsquobatchesrsquorsquo of in-memory equivalencesand when in-memory capacity is achieved we write thesebatches to disk and proceed with on-disk computation this isparticularly useful for computing the small number of longequality chains which would otherwise require sorts andmerge-joins over all of the canonical owlsameAs data currentlyderived and where the number of iterations would otherwise bethe length of the longest chain The result of this process is a setof canonicalised equality relations representing the symmetrictransitive closure

Next we briefly describe the process of canonicalising data withrespect to this on-disk equality closure where we again use exter-nal-sorts and merge-joins First we prune the owlsameAs indexto only maintain relations s1 SA s2 such that s1 gt s2 thus given s1

SA s2 we know that s2 is the canonical identifier and s1 is to berewritten We then sort the data according to the position thatwe wish to rewrite and perform a merge-join over both sets ofdata buffering the canonicalised data to an output file If we wantto rewrite multiple positions of a file of tuples (eg subject and

object) we must rewrite one position sort the results by the sec-ond position and then rewrite the second position20

Finally note that in the derivation of owlsameAs from the con-solidation rules prp-fp prp-ifp cax-maxc2 the overall processmay be iterative For example consider

dblpAxel foafisPrimaryTopicOf lthttppolleresnetgt

axelme foafisPrimaryTopicOf lthttpaxelderiiegt

lthttppolleresnetgt owlsameAs lthttpaxelderiiegt

from which the conclusion that dblpAxel is the same asaxelme holds We see that new owlsameAs relations (either as-serted or derived from the consolidation rules) may in turn lsquolsquoalignrsquorsquoterms in the join position of the consolidation rules leading to newequivalences Thus for deriving the final owlsameAs we require ahigher-level iterative process as follows

(i) initially apply the consolidation rules and append theresults to a file alongside the asserted owl sameAs state-ments found

(ii) apply the initial closure of the owlsameAs data(iii) then iteratively until no new owlsameAs inferences are

found

ndash rewrite the join positions of the on-disk files containing

the data for each consolidation rule according to the cur-rent owlsameAs data

ndash derive new owlsameAs inferences possible through theprevious rewriting for each consolidation rule

ndash re-derive the closure of the owlsameAs data includingthe new inferences

The final closed file of owlsameAs data can then be re-used torewrite the main corpus in two sorts and merge-join scans oversubject and object

52 Distributed approach

The distributed approach follows quite naturally from theprevious discussion As before we assume that the input dataare evenly pre-distributed over the slave machines (in anyarbitrary ordering) where we can then apply the followingprocess

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract relevant terminologicalknowledge

(ii) Gather gather terminological data onto the master machineand thereafter bind the terminological patterns of the gen-eralconsolidation rules

(iii) Flood flood the rules for reasoning and the consolidation-relevant patterns to all slave machines

(iv) Run apply reasoning and extract consolidation-relevantstatements from the input and inferred data

(v) Gather gather all consolidation statements onto the mastermachine then in parallel

t

f

Table 8Breakdown of timing of distributed extended consolidation wreasoning where identifies the masterslave tasks that run concurrently

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 4543 1000 2302 1000 1275 1000 813 1000 7404 1000Executing 299 66 303 131 325 255 301 370 2436 329

Aggregate Consolidation Relevant Data 43 09 45 20 47 37 46 57 299 40Closing owlsameAs

244 54 245 106 254 199 244 300 2112 285Miscellaneous 12 03 13 05 24 19 11 13 25 03

Idle (waiting for slaves) 4245 934 1999 869 950 745 512 630 4968 671

SlaveAvg Executing (total) 4489 988 2212 961 1116 875 565 694 5991 809

Extract Terminology 444 98 227 99 118 92 69 84 494 67Extract Consolidation Relevant Data 955 210 483 210 243 191 132 162 1379 186Initial Sort (by subject) 980 216 483 210 237 186 116 142 1338 181Consolidation 2110 464 1019 443 519 407 249 306 2780 375

Avg Idle 55 12 89 39 379 297 236 290 1413 191Waiting for peers 00 00 32 14 71 56 63 77 375 51Waiting for master 55 12 58 25 308 241 173 212 1038 140

90 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

ndash Local compute the closure of the consolidation rules andthe owlsameAs data on the master machine

ndash Run each slave machine sorts its fragment of the maincorpus according to natural order (spoc)

1 At least in terms of pure quantity However we do not give an indication of theuality or importance of those few equivalences we miss with this approximationhich may well be application specific

(vi) Flood send the closed owlsameAs data to the slavemachines once the distributed sort has been completed

(vii) Run each slave machine then rewrites the subjects of theirsegment of the corpus subsequently sorts the rewritten databy object and then rewrites the objects (of non-rdftypetriples) with respect to the closed owlsameAs data

53 Performance evaluation

Applying the above process to our 1118 billion quadruple corpustook 1234 h Extracting the terminological data took 114 h with anaverage idle time of 19 min (277) (one machine took 18 minlonger than the rest due to processing a large ontology containing2 million quadruples [32]) Merging and aggregating the termino-logical data took roughly 1 min Applying the reasoning andextracting the consolidation relevant statements took 234 h withan average idle time of 25 min (18) Aggregating and mergingthe consolidation relevant statements took 299 min Thereafterlocally computing the closure of the consolidation rules and theequality data took 352 h with the computation requiring two itera-tions overall (the minimum possiblemdashthe second iteration did notproduce any new results) Concurrent to the previous step the paral-lel sort of remote data by natural order took 233 h with an averageidle time of 6 min (43) Subsequent parallel consolidation of thedata took 48 h with 10 min (35) average idle timemdashof this 19of the time was spent consolidating the pre-sorted subjects 60of the time was spent sorting the rewritten data by object and21 of the time was spent consolidating the objects of the data

As before Table 8 summarises the timing of the task Focusing onthe performance for the entire corpus the master machine took406 h to coordinate global knowledge constituting the lower boundon time possible for the task to execute with respect to increasingmachines in our setupmdashin future it may be worthwhile to investigatedistributed strategies for computing the owlsameAs closure(which takes 285 of the total computation time) but for the mo-ment we mitigate the cost by concurrently running a sort on theslave machines thus keeping the slaves busy for 634 of the timetaken for this local aggregation step The slave machines were onaverage busy for 809 of the total task time of the idle time733 was spent waiting for the master machine to aggregate theconsolidation relevant data and to finish the closure of owl sameAs

data and the balance (267) was spent waiting for peers to finish(mostly during the extraction of terminological data)

With regards performance evaluation over the 100 million qua-druple corpus for a varying number of slave machines perhapsmost notably the average idle time of the slave machines increasesquite rapidly as the number of machines increases In particular by4 machines the slaves can perform the distributed sort-by-subjecttask faster than the master machine can perform the local owlsameAs closure The significant workload of the master machinewill eventually become a bottleneck as the number of slave ma-chines increases further This is also noticeable in terms of overalltask time the respective time savings as the number of slave ma-chines doubles (moving from 1 slave machine to 8) are 0507 0554 and 0638 respectively

Briefly we also ran the consolidation without the general rea-soning rules (Table 3) motivated earlier With respect to perfor-mance the main variations were given by (i) the extraction ofconsolidation relevant statementsmdashthis time directly extractedfrom explicit statements as opposed to explicit and inferred state-mentsmdashwhich took 154 min (11 of the time taken including thegeneral reasoning) with an average idle time of less than one min-ute (6 average idle time) (ii) local aggregation of the consolida-tion relevant statements took 17 min (569 of the time takenpreviously) (iii) local closure of the owlsameAs data took318 h (904 of the time taken previously) The total time savedequated to 28h (227) where 333 min were saved from coordi-nation on the master machine and 225 h were saved from parallelexecution on the slave machines

54 Results evaluation

In this section we present the results of the consolidationwhich included the general reasoning step in the extraction ofthe consolidation-relevant statements In fact we found that theonly major variation between the two approaches was in theamount of consolidation-relevant statements collected (discussedpresently) where the changes in the total amounts of coreferentidentifiers equivalence classes and canonicalised terms were infact negligible (lt01) Thus for our corpus extracting only as-serted consolidation-relevant statements offered a very closeapproximation of the extended reasoning approach21

2

qw

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 10: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 85

thousand coreferent pairs In particular this section serves as abaseline for comparison against the extended consolidation ap-proach that uses richer reasoning features explored later inSection 5

41 High-level approach

Based on the previous discussion the approach is straight-forward

(i) scan the corpus and separate out all asserted owlsameAs

relations from the main body of the corpus(ii) load these relations into an in-memory index that encodes

the transitive and symmetric semantics of owlsameAs(iii) for each equivalence class in the index choose a canonical

term(iv) scan the corpus again canonicalising any term in the

subject position or object position of an rdftype

triple

Thus we need only index a small subset of the corpusmdashowlsameAs statementsmdashand can apply consolidation by meansof two scans

The non-trivial aspects of the algorithm are given by theequality closure and index To perform the in-memory transi-tive symmetric closure we use a traditional unionndashfind algo-rithm [6640] for computing equivalence partitions where (i)equivalent elements are stored in common sets such that eachelement is only contained in one set (ii) a map provides a find

function for looking up which set an element belongs to and(iii) when new equivalent elements are found the sets they be-long to are unioned The process is based on an in-memory mapindex and is detailed in Algorithm 1 where

(i) the eqc labels refer to equivalence classes and s and o refer toRDF subject and object terms

(ii) the function mapget refers to an identifier lookup on the in-memory index that should return the intermediary equiva-lence class associated with that identifier (ie find)

(iii) the function mapput associates an identifier with a newequivalence class

The output of the algorithm is an in-memory map from identi-fiers to their respective equivalence class

Next we must choose a canonical term for each equivalenceclass we prefer URIs over blank-nodes thereafter choosing a termwith the lowest alphabetical ordering a canonical term is thusassociated to each equivalent set Once the equivalence index hasbeen finalised we re-scan the corpus and canonicalise the datausing the in-memory index to service lookups

42 Distributed approach

Again distribution of the approach is fairly intuitive as follows

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract owlsameAs relations

(ii) Gather gather all owlsameAs relations onto the mastermachine and build the in-memory equality index

(iii) Floodrun send the equality index (in its entirety) to eachslave machine and apply the consolidation scan in parallel

As we will see in the next section the most expensive meth-odsmdashinvolving the two scans of the main corpusmdashcan be con-ducted in parallel

Algorithm 1

Building equivalence map [32]

Require SAMEAS DATA SA1 map 2 for (sowlsameAso) 2 SA ^ s ndash o do3 eqcs mapget(s)4 eqco mapget(o)5 if eqcs = ^ eqco = then6 eqcs[o so7 mapput(seqcs[o)8 mapput(oeqcs[o)9 else if eqcs = then

10 add s to eqco

11 mapput(seqco)12 else if eqco = then13 add o to eqcs

14 mapput(oeqcs)15 else if eqcs ndash eqco then16 if jeqcsj gt jeqcoj17 add all eqco into eqcs

18 for eo 2 eqco do19 mapput(eoeqcs)20 end for21 else22 add all eqcs into eqco

23 for es 2 eqcs do24 mapput(eseqco)25 end for26 end if27 end if28 end for

43 Performance evaluation

We applied the distributed base-line consolidation over our cor-pus with the aformentioned procedure and setup The entire con-solidation process took 633 min with the bulk of time taken asfollows the first scan extracting owlsameAs statements took125 min with an average idle time for the servers of 11 s(14)mdashie on average the slave machines spent 14 of the timeidly waiting for peers to finish Transferring aggregating and load-ing the owlsameAs statements on the master machine took84 min The second scan rewriting the data according to thecanonical identifiers took in total 423 min with an average idletime of 647 s (25) for each machine at the end of the roundThe slower time for the second round is attributable to the extraoverhead of re-writing the GZip-compressed data to disk as op-posed to just reading

In the rightmost column of Table 5 (full-8) we give a break-down of the timing for the tasks over the full corpus using eightslave machines and one master machine Independent of the num-ber of slaves we note that the master machine required 85 min forcoordinating globally-required owlsameAs knowledge and thatthe rest of the task time is spent in embarrassingly parallel execu-tion (amenable to reduction by increasing the number of ma-chines) For our setup the slave machines were kept busy for onaverage 846 of the total task time of the idle time 87 wasspent waiting for the master to coordinate the owlsameAs dataand 13 was spent waiting for peers to finish their task due tosub-optimal load balancing The master machine spent 866 ofthe task idle waiting for the slaves to finish

In addition Table 5 also gives performance results for onemaster and varying numbers of slave machines (1248) over the

Table 5Breakdown of timing of distributed baseline consolidation

Category 1 2 4 8 Full-8

Min Min Min min Min

MasterTotal execution time 422 1000 223 1000 119 1000 67 1000 633 1000Executing 05 11 05 22 05 40 05 75 85 134

Aggregating owl sameAs 04 11 05 22 05 40 05 75 84 133Miscellaneous 01 01 01 04 01 01

Idle (waiting for slaves) 418 989 218 977 114 958 62 921 548 866

SlavesAvg Executing (total) 418 989 218 977 114 958 62 921 535 846

Extract owl sameAs 90 214 45 202 22 184 11 168 123 195Consolidate 327 775 171 769 88 735 47 705 412 651

Avg Idle 05 11 07 29 10 82 08 127 98 154Waiting for peers 00 00 01 07 05 40 03 47 13 20Waiting for master 05 11 05 23 05 42 05 79 85 134

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000

num

ber o

f cla

sses

equivalence class size

Fig 3 Distribution of sizes of equivalence classes on loglog scale (from [32])

1

10

100

1000

10000

100000

1e+006

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

Fig 4 Distribution of the number of PLDs per equivalence class on loglog scale

2 For example see the coreference results given by httpwwwrkbexplorercoma m e A s u r i = h t t p w w w a c m r k b e x p l o r e r c o m i d p e r s o n - 5 3 2 9 2 -2877d02973d0d01e8f29c7113776e7e which at the time of writing correspond to36 out of the 443 equivalent URIs found for Dieter Fensel

86 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

100 million quadruple corpus Note that in the table we use thesymbol to indicate a negligible value (0 lt lt 005) We see thatas the number of slave machines increases so too does the percent-age of time taken to aggregate global knowledge on the master ma-chine (the absolute time stays roughly stable) This indicates thatfor a high number of machines the aggregation of global knowl-edge will eventually become a bottleneck and an alternative distri-bution strategy may be preferable subject to further investigationOverall however we see that the total execution times roughlyhalve when the number of slave machines are doubled we observea 0528 0536 and 0561 reduction in total task executiontime moving from 1 slave machine to 2 2 to 4 and 4 to 8 respec-tively (here a value of 05 would indicate linear scaling out)

44 Results evaluation

We extracted 1193 million raw owlsameAs quadruples fromthe full corpus Of these however there were only 377 million un-ique triples After closure the data formed 216 million equivalenceclasses mentioning 575 million terms (624 of URIs)mdashan averageof 265 elements per equivalence class Of the 575 million termsonly 4156 were blank-nodes Fig 3 presents the distribution ofsizes of the equivalence classes where the largest equivalenceclass contains 8481 equivalent entities and 16 million (741)equivalence classes contain the minimum two equivalent identifi-ers Fig 4 shows a similar distribution but for the number of PLDsextracted from URIs in each equivalence class Interestingly themajority (571) of equivalence classes contained identifiers from

more than one PLD where the most diverse equivalence class con-tained URIs from 32 PLDs

Table 6 shows the canonical URIs for the largest five equiva-lence classes we manually inspected the results and showwhether or not the results were verified as correctincorrect In-deed results for class 1 and 2 were deemed incorrect due toover-use of owlsameAs for linking drug-related entities in theDailyMed and LinkedCT exporters Results 3 and 5 were verifiedas correct consolidation of prominent Semantic Web relatedauthors resp Dieter Fensel and Rudi Studermdashauthors are gi-ven many duplicate URIs by the RKBExplorer coreference index12

Result 4 contained URIs from various sites generally referring tothe United States mostly from DBPedia and LastFM With respectto the DBPedia URIs these (i) were equivalent but for capitilisationvariations or stop-words (ii) were variations of abbreviations or va-lid synonyms (iii) were different language versions (eg dbpediaEtats_Unis) (iv) were nicknames (eg dbpediaYankee_land)(v) were related but not equivalent (eg dbpediaAmerican_Civilization) (vi) were just noise (eg dbpediaLOL_Dean)

Besides the largest equivalence classes which we have seen areprone to errors perhaps due to the snowballing effect of the tran-sitive and symmetric closure we also manually evaluated a ran-dom sample of results For the sampling we take the closed

1

s24

Table 6Largest five equivalence classes (from [32])

Canonical term (lexically first in equivalence class) Size Correct

1 httpbio2rdforgdailymed_drugs1000 84812 httpbio2rdforgdailymed_drugs1042 8003 httpacmrkbexplorercomidperson-53292-22877d02973d0d01e8f29c7113776e7e 443 U

4 httpagame2teachcomddb61cae0e083f705f65944cc3bb3968ce3f3ab59-ge_1 353 U5 httpwwwacmrkbexplorercomidperson-236166-1b4ef5fdf4a5216256064c45a8923bc9 316 U

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 87

equivalence classes extracted from the full corpus pick an identi-fier (possibly non-canonical) at random and then randomly pickanother coreferent identifier from its equivalence class We thenretrieve the data associated for both identifiers and manually in-spect them to see if they are in fact equivalent For each pairwe then select one of four options SAME DIFFERENT UNCLEAR TRIVIALLY

SAME We used the UNCLEAR option sparingly and only where therewas not enough information to make an informed decision or fordifficult subjective choices note that where there was not enoughinformation in our corpus we tried to dereference more informa-tion to minimise use of UNCLEAR in terms of difficult subjectivechoices one example pair we marked unclear was dbpediaFolk-core and dbpediaExperimental_folk We used the TRIVIALLY

SAME option to indicate that an equivalence is purely lsquolsquosyntacticrsquorsquowhere one identifier has no information other than being the targetof a owlsameAs link although the identifiers are still coreferentwe distinguish this case since the equivalence is not so lsquolsquomeaning-fulrsquorsquo For the 1000 pairs we choose option TRIVIALLY SAME 661 times(661) SAME 301 times (301) DIFFERENT 28 times (28) and UN-

CLEAR 10 times (1) In summary our manual verification of thebaseline results puts the precision at 972 albeit with manysyntatic equivalences found Of those found to be incorrect manywere closely-related DBpedia resources that were linked to by thesame resource on the Freebase exporter an example of this wasdbpediaRock_Hudson and dbpediaRock_Hudson_filmogra-phy having owlsameAs links from the same Freebase conceptfbrock_hudson

Moving on in Table 7 we give the most frequently co-occurringPLD-pairs in our equivalence classes where datasets resident onthese domains are lsquolsquoheavilyrsquorsquo interlinked with owl sameAs rela-tions We italicise the indexes of inter-domain links that were ob-served to often be purely syntactic

With respect to consolidation identifiers in 786 million subjectpositions (7 of subject positions) and 232 million non-rdftype-object positions (26) were rewritten giving a total of1019 million positions rewritten (51 of total rewritable posi-tions) The average number of documents mentioning each URIrose slightly from 4691 to 4719 (a 06 increase) due to consoli-dation and the average number of PLDs also rose slightly from1005 to 1007 (a 02 increase)

Table 7Top 10 PLD pairs co-occurring in the equivalence classes with number of equivalenceclasses they co-occur in

PLD PLD Co-occur

1 rdfizecom uriburnercom 5066232 dbpediaorg freebasecom 1870543 bio2rdforg purlorg 1853924 locgov infolcauthorities

a 1666505 l3sde rkbexplorercom 1258426 bibsonomyorg l3sde 1258327 bibsonomyorg rkbexplorercom 1258308 dbpediaorg mpiide 998279 freebasecom mpiide 8961010 dbpediaorg umbelorg 65565

a In fact these were within the same PLD

5 Extended reasoning consolidation

We now look at extending the baseline approach to includemore expressive reasoning capabilities in particular using OWL 2RLRDF rules (introduced previously in Section 26) to infer novelowlsameAs relations that can then be used for consolidationalongside the explicit relations asserted by publishers

As described in Section 22 OWL provides various featuresmdashincluding functional properties inverse-functional properties andcertain cardinality restrictionsmdashthat allow for inferring novelowlsameAs data Such features could ease the burden on publish-ers of providing explicit owlsameAs mappings to coreferent iden-tifiers in remote naming schemes For example inverse-functionalproperties can be used in conjunction with legacy identificationschemesmdashsuch as ISBNs for books EAN UCC-13 or MPN for prod-ucts MAC addresses for network-enabled devices etcmdashto boot-strap identity on the Web of Data within certain domains suchidentification values can be encoded as simple datatype stringsthus bypassing the requirement for bespoke agreement or map-pings between URIs Also lsquolsquoinformation resourcesrsquorsquo with indigenousURIs can be used for indirectly identifying related resources towhich they have a functional mapping where examples includepersonal email-addresses personal homepages WIKIPEDIA articlesetc

Although Linked Data literature has not explicitly endorsed orencouraged such usage prominent grass-roots efforts publishingRDF on the Web rely (or have relied) on inverse-functional proper-ties for maintaining consistent identity For example in the FriendOf A Friend (FOAF) community a technique called smushing wasproposed to leverage such properties for identity serving as anearly precursor to methods described herein13

Versus the baseline consolidation approach we must first ap-ply some reasoning techniques to infer novel owlsameAs rela-tions Once we have the inferred owlsameAs data materialisedand merged with the asserted owlsameAs relations we can thenapply the same consolidation approach as per the baseline (i) ap-ply the transitive symmetric closure and generate the equivalenceclasses (ii) pick canonical identifiers for each class and (iii) re-write the corpus However in contrast to the baseline we expectthe extended volume of owlsameAs data to make in-memorystorage infeasible14 Thus in this section we avoid the need tostore equivalence information in-memory where we instead inves-tigate batch-processing methods for applying an on-disk closure ofthe equivalence relations and likewise on-disk methods for rewrit-ing data

Early versions of the local batch-processing [31] techniques andsome of the distributed reasoning techniques [33] are borrowedfrom previous works Herein we reformulate and combine thesetechniques into a complete solution for applying enhanced consol-idation in a distributed scalable manner over large-scale Linked

13 cf httpwikifoaf-projectorgwSmushing retr 2011012214 Further note that since all machines must have all owlsameAs information

adding more machines does not increase the capacity for handling more suchrelations the scalability of the baseline approach is limited by the machine with theleast in-memory capacity

88 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Data corpora We provide new evaluation over our 1118 billionquadruple evaluation corpus presenting new performance resultsover the full corpus and for a varying number of slave machinesWe also analyse in depth the results of applying the techniquesover our evaluation corpus analysing the applicability of ourmethods in such scenarios We contrast the results given by thebaseline and extended consolidation approaches and again manu-ally evaluate the results of the extended consolidation approach for1000 pairs found to be coreferent

51 High-level approach

In Table 2 we provide the pertinent rules for inferring new owl

sameAs relations from the data However after analysis of thedata we observed that no documents used the owlmaxQuali-

fiedCardinality construct required for the cls-maxqc rulesand that only one document defined one owlhasKey axiom15

involving properties with less than five occurrences in the datamdashhence we leave implementation of these rules for future work andnote that these new OWL 2 constructs have probably not yet had timeto find proper traction on the Web Thus on top of inferencing involv-ing explicit owlsameAs we are left with rule prp-fp which supportsthe semantics of properties typed owlFunctionalProperty andrule prp-ifp which supports the semantics of properties typedowlInverseFunctionalProperty and rule cls-maxc1 whichsupports the semantics of classes with a specified cardinality of 1for some defined property (a class restricted version of the func-tional-property inferencing)16

Thus we look at using OWL 2 RLRDF rules prp-fp prp-ifp andcls-maxc2 for inferring new owl sameAs relations between indi-vidualsmdashwe also support an additional rule that gives an exact car-dinality version of cls-maxc217 Herein we refer to these rules asconsolidation rules

We also investigate pre-applying more general OWL 2 RLRDFreasoning over the corpus to derive more complete results wherethe ruleset is available in Table 3 and is restricted to OWL 2 RLRDFrules with one assertional pattern [33] unlike the consolidationrules these general rules do not require the computation of asser-tional joins and thus are amenable to execution by means of a sin-gle-scan of the corpus We have demonstrated this profile ofreasoning to have good competency with respect to the features ofRDFS and OWL used in Linked Data [29] These general rules mayindirectly lead to the inference of additional owlsameAs relationsparticularly when combined with the consolidation rules of Table 2

Once we have used reasoning to derive novel owlsameAs

data we then apply an on-disk batch processing technique toclose the equivalence relation and to ultimately consolidate thecorpus

Thus our high-level approach is as follows

(i) extract relevant terminological data from the corpus(ii) bind the terminological patterns in the rules from this data

thus creating a larger set of general rules with only oneassertional pattern and identifying assertional patterns thatare useful for consolidation

(iii) apply general-rules over the corpus and buffer any inputinferred statements relevant for consolidation to a new file

(iv) derive the closure of owlsameAs statements from the con-solidation-relevant dataset

15 httphuemerlstadlernetrolerhrdf16 Note that we have presented non-distributed batch-processing execution of

these rules at a smaller scale (147 million quadruples) in [31]17 Exact cardinalities are disallowed in OWL 2 RL due to their effect on the formal

proposition of completeness underlying the profile but such considerations are mootin our scenario

(v) apply consolidation over the main corpus with respect to theclosed owlsameAs data

In Step (i) we extract terminological datamdashrequired for applica-tion of our rulesmdashfrom the main corpus As discussed in Section 25to help ensure robustness we apply authoritative reasoning inother words at this stage we discard any third-party terminologi-cal statements that affect inferencing over instance data for remoteclasses and properties

In Step (ii) we then compile the authoritative terminologicalstatements into the rules to generate a set of T-ground (purelyassertional) rules which can then be applied over the entirety ofthe corpus [33] As mentioned in Section 26 the T-ground generalrules only contain one assertional pattern and thus are amenableto execution by a single scan eg given the statement

homepage rdfssubPropertyOf isPrimaryTopicOf

we would generate a T-ground rulex homepage y

)x isPrimaryTopicOf y

which does not require the computation of joins We also bindthe terminological patterns of the consolidation rules in Table 2however these rules cannot be performed by means of a singlescan over the data since they contain multiple assertional pat-terns in the body Thus we instead extract triple patternssuch as

x isPrimaryTopicOf y

which may indicate data useful for consolidation (note that hereisPrimaryTopicOf is assumed to be inverse-functional)

In Step (iii) we then apply the T-ground general rules over thecorpus with a single scan Any input or inferred statements match-ing a consolidation-relevant patternmdashincluding any owlsameAs

statements foundmdashare buffered to a file for later join computationSubsequently in Step (iv) we must now compute the on-disk

canonicalised closure of the owlsameAs statements In particularwe mainly employ the following three on-disk primitives

(i) sequential scans of flat files containing line-delimitedtuples18

(ii) external-sorts where batches of statements are sorted inmemory the sorted batches written to disk and the sortedbatches merged to the final output and

(iii) merge-joins where multiple sets of data are sorted accordingto their required join position and subsequently scanned inan interleaving manner that aligns on the join position andwhere an in-memory join is applied for each individual joinelement

Using these primitives to compute the owlsameAs closureminimises the amount of main memory required where we havepresented similar approaches in [3031]

First assertional memberships of functional properties inverse-functional properties and cardinality restrictions (both propertiesand classes) are written to separate on-disk files For functional-property and cardinality reasoning a consistent join variable forthe assertional patterns is given by the subject position for in-verse-functional-property reasoning a join variable is given bythe object position19 Thus we can sort the former sets of data

8 These files are G-Zip compressed flat files of N-Triple-like syntax encodingrbitrary length tuples of RDF constants9 Although a predicate-position join is also available we prefer object-positionins that provide smaller batches of data for the in-memory join

1

a1

jo

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 89

according to subject and perform a merge-join by means of a linearscan thereafter the same procedure applies to the latter set sortingand merge-joining on the object position Applying merge-join scanswe produce new owlsameAs statements

Both the originally asserted and newly inferred owlsameAs

relations are similarly written to an on-disk file over which wenow wish to perform the canonicalised symmetrictransitive clo-sure We apply a similar method again leveraging external sortsand merge-joins to perform the computation (herein we sketchand point the interested reader to [31 Section 46]) In the follow-ing we use gt and lt to denote lexical ordering SA as a shortcut forowlsameAs a b c etc to denote members of U [ B such thata lt b lt c and define URIs to be lexically lower than blank nodesThe process is as follows

(i) we only materialise symmetric equality relations thatinvolve a (possibly intermediary) canonical term chosenby a lexical ordering given b SA a we materialise a SA bgiven a SA b SA c we materialise the relations a SA b aSA c and their inverses but do not materialise b SA c orits inverse

(ii) transitivity is supported by iterative merge-join scans

20 One could instead build an on-disk map for equivalence classes and pivoelements and follow a consolidation method similar to the previous section over theunordered data however we would expect such an on-disk index to have a low cachehit-rate given the nature of the data which would lead to many (slow) disk seeks Analternative approach might be to hash-partition the corpus and equality index overdifferent machines however this would require a non-trivial minimum amount omemory on the cluster

ndash in the scan if we find c SA a (sorted by object) and c SA d(sorted naturally) we infer a SA d and drop the non-canonical c SA d (and d SA c)

ndash at the end of the scan newly inferred triples are marked and merge-joined

into the main equality data any triples echoing an earlier inference are ignored dropped non-canonical statements are removed

ndash the process is then iterative in the next scan if we find dSA a and d SA e we infer a SA e and e SA a

ndash inferences will only occur if they involve a statementadded in the previous iteration ensuring that inferencesteps are not re-computed and that the computation willterminate

(iii) the above iterations stop when a fixpoint is reached andnothing new is inferred

(iv) the process of reaching a fixpoint is accelerated using avail-able main-memory to store a cache of partial equalitychains

The above steps follow well-known methods for transtive clo-sure computation modified to support the symmetry ofowlsameAs and to use adaptive canonicalisation in order toavoid quadratic output With respect to the last item we useAlgorithm 1 to derive lsquolsquobatchesrsquorsquo of in-memory equivalencesand when in-memory capacity is achieved we write thesebatches to disk and proceed with on-disk computation this isparticularly useful for computing the small number of longequality chains which would otherwise require sorts andmerge-joins over all of the canonical owlsameAs data currentlyderived and where the number of iterations would otherwise bethe length of the longest chain The result of this process is a setof canonicalised equality relations representing the symmetrictransitive closure

Next we briefly describe the process of canonicalising data withrespect to this on-disk equality closure where we again use exter-nal-sorts and merge-joins First we prune the owlsameAs indexto only maintain relations s1 SA s2 such that s1 gt s2 thus given s1

SA s2 we know that s2 is the canonical identifier and s1 is to berewritten We then sort the data according to the position thatwe wish to rewrite and perform a merge-join over both sets ofdata buffering the canonicalised data to an output file If we wantto rewrite multiple positions of a file of tuples (eg subject and

object) we must rewrite one position sort the results by the sec-ond position and then rewrite the second position20

Finally note that in the derivation of owlsameAs from the con-solidation rules prp-fp prp-ifp cax-maxc2 the overall processmay be iterative For example consider

dblpAxel foafisPrimaryTopicOf lthttppolleresnetgt

axelme foafisPrimaryTopicOf lthttpaxelderiiegt

lthttppolleresnetgt owlsameAs lthttpaxelderiiegt

from which the conclusion that dblpAxel is the same asaxelme holds We see that new owlsameAs relations (either as-serted or derived from the consolidation rules) may in turn lsquolsquoalignrsquorsquoterms in the join position of the consolidation rules leading to newequivalences Thus for deriving the final owlsameAs we require ahigher-level iterative process as follows

(i) initially apply the consolidation rules and append theresults to a file alongside the asserted owl sameAs state-ments found

(ii) apply the initial closure of the owlsameAs data(iii) then iteratively until no new owlsameAs inferences are

found

ndash rewrite the join positions of the on-disk files containing

the data for each consolidation rule according to the cur-rent owlsameAs data

ndash derive new owlsameAs inferences possible through theprevious rewriting for each consolidation rule

ndash re-derive the closure of the owlsameAs data includingthe new inferences

The final closed file of owlsameAs data can then be re-used torewrite the main corpus in two sorts and merge-join scans oversubject and object

52 Distributed approach

The distributed approach follows quite naturally from theprevious discussion As before we assume that the input dataare evenly pre-distributed over the slave machines (in anyarbitrary ordering) where we can then apply the followingprocess

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract relevant terminologicalknowledge

(ii) Gather gather terminological data onto the master machineand thereafter bind the terminological patterns of the gen-eralconsolidation rules

(iii) Flood flood the rules for reasoning and the consolidation-relevant patterns to all slave machines

(iv) Run apply reasoning and extract consolidation-relevantstatements from the input and inferred data

(v) Gather gather all consolidation statements onto the mastermachine then in parallel

t

f

Table 8Breakdown of timing of distributed extended consolidation wreasoning where identifies the masterslave tasks that run concurrently

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 4543 1000 2302 1000 1275 1000 813 1000 7404 1000Executing 299 66 303 131 325 255 301 370 2436 329

Aggregate Consolidation Relevant Data 43 09 45 20 47 37 46 57 299 40Closing owlsameAs

244 54 245 106 254 199 244 300 2112 285Miscellaneous 12 03 13 05 24 19 11 13 25 03

Idle (waiting for slaves) 4245 934 1999 869 950 745 512 630 4968 671

SlaveAvg Executing (total) 4489 988 2212 961 1116 875 565 694 5991 809

Extract Terminology 444 98 227 99 118 92 69 84 494 67Extract Consolidation Relevant Data 955 210 483 210 243 191 132 162 1379 186Initial Sort (by subject) 980 216 483 210 237 186 116 142 1338 181Consolidation 2110 464 1019 443 519 407 249 306 2780 375

Avg Idle 55 12 89 39 379 297 236 290 1413 191Waiting for peers 00 00 32 14 71 56 63 77 375 51Waiting for master 55 12 58 25 308 241 173 212 1038 140

90 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

ndash Local compute the closure of the consolidation rules andthe owlsameAs data on the master machine

ndash Run each slave machine sorts its fragment of the maincorpus according to natural order (spoc)

1 At least in terms of pure quantity However we do not give an indication of theuality or importance of those few equivalences we miss with this approximationhich may well be application specific

(vi) Flood send the closed owlsameAs data to the slavemachines once the distributed sort has been completed

(vii) Run each slave machine then rewrites the subjects of theirsegment of the corpus subsequently sorts the rewritten databy object and then rewrites the objects (of non-rdftypetriples) with respect to the closed owlsameAs data

53 Performance evaluation

Applying the above process to our 1118 billion quadruple corpustook 1234 h Extracting the terminological data took 114 h with anaverage idle time of 19 min (277) (one machine took 18 minlonger than the rest due to processing a large ontology containing2 million quadruples [32]) Merging and aggregating the termino-logical data took roughly 1 min Applying the reasoning andextracting the consolidation relevant statements took 234 h withan average idle time of 25 min (18) Aggregating and mergingthe consolidation relevant statements took 299 min Thereafterlocally computing the closure of the consolidation rules and theequality data took 352 h with the computation requiring two itera-tions overall (the minimum possiblemdashthe second iteration did notproduce any new results) Concurrent to the previous step the paral-lel sort of remote data by natural order took 233 h with an averageidle time of 6 min (43) Subsequent parallel consolidation of thedata took 48 h with 10 min (35) average idle timemdashof this 19of the time was spent consolidating the pre-sorted subjects 60of the time was spent sorting the rewritten data by object and21 of the time was spent consolidating the objects of the data

As before Table 8 summarises the timing of the task Focusing onthe performance for the entire corpus the master machine took406 h to coordinate global knowledge constituting the lower boundon time possible for the task to execute with respect to increasingmachines in our setupmdashin future it may be worthwhile to investigatedistributed strategies for computing the owlsameAs closure(which takes 285 of the total computation time) but for the mo-ment we mitigate the cost by concurrently running a sort on theslave machines thus keeping the slaves busy for 634 of the timetaken for this local aggregation step The slave machines were onaverage busy for 809 of the total task time of the idle time733 was spent waiting for the master machine to aggregate theconsolidation relevant data and to finish the closure of owl sameAs

data and the balance (267) was spent waiting for peers to finish(mostly during the extraction of terminological data)

With regards performance evaluation over the 100 million qua-druple corpus for a varying number of slave machines perhapsmost notably the average idle time of the slave machines increasesquite rapidly as the number of machines increases In particular by4 machines the slaves can perform the distributed sort-by-subjecttask faster than the master machine can perform the local owlsameAs closure The significant workload of the master machinewill eventually become a bottleneck as the number of slave ma-chines increases further This is also noticeable in terms of overalltask time the respective time savings as the number of slave ma-chines doubles (moving from 1 slave machine to 8) are 0507 0554 and 0638 respectively

Briefly we also ran the consolidation without the general rea-soning rules (Table 3) motivated earlier With respect to perfor-mance the main variations were given by (i) the extraction ofconsolidation relevant statementsmdashthis time directly extractedfrom explicit statements as opposed to explicit and inferred state-mentsmdashwhich took 154 min (11 of the time taken including thegeneral reasoning) with an average idle time of less than one min-ute (6 average idle time) (ii) local aggregation of the consolida-tion relevant statements took 17 min (569 of the time takenpreviously) (iii) local closure of the owlsameAs data took318 h (904 of the time taken previously) The total time savedequated to 28h (227) where 333 min were saved from coordi-nation on the master machine and 225 h were saved from parallelexecution on the slave machines

54 Results evaluation

In this section we present the results of the consolidationwhich included the general reasoning step in the extraction ofthe consolidation-relevant statements In fact we found that theonly major variation between the two approaches was in theamount of consolidation-relevant statements collected (discussedpresently) where the changes in the total amounts of coreferentidentifiers equivalence classes and canonicalised terms were infact negligible (lt01) Thus for our corpus extracting only as-serted consolidation-relevant statements offered a very closeapproximation of the extended reasoning approach21

2

qw

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 11: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

Table 5Breakdown of timing of distributed baseline consolidation

Category 1 2 4 8 Full-8

Min Min Min min Min

MasterTotal execution time 422 1000 223 1000 119 1000 67 1000 633 1000Executing 05 11 05 22 05 40 05 75 85 134

Aggregating owl sameAs 04 11 05 22 05 40 05 75 84 133Miscellaneous 01 01 01 04 01 01

Idle (waiting for slaves) 418 989 218 977 114 958 62 921 548 866

SlavesAvg Executing (total) 418 989 218 977 114 958 62 921 535 846

Extract owl sameAs 90 214 45 202 22 184 11 168 123 195Consolidate 327 775 171 769 88 735 47 705 412 651

Avg Idle 05 11 07 29 10 82 08 127 98 154Waiting for peers 00 00 01 07 05 40 03 47 13 20Waiting for master 05 11 05 23 05 42 05 79 85 134

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000

num

ber o

f cla

sses

equivalence class size

Fig 3 Distribution of sizes of equivalence classes on loglog scale (from [32])

1

10

100

1000

10000

100000

1e+006

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

Fig 4 Distribution of the number of PLDs per equivalence class on loglog scale

2 For example see the coreference results given by httpwwwrkbexplorercoma m e A s u r i = h t t p w w w a c m r k b e x p l o r e r c o m i d p e r s o n - 5 3 2 9 2 -2877d02973d0d01e8f29c7113776e7e which at the time of writing correspond to36 out of the 443 equivalent URIs found for Dieter Fensel

86 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

100 million quadruple corpus Note that in the table we use thesymbol to indicate a negligible value (0 lt lt 005) We see thatas the number of slave machines increases so too does the percent-age of time taken to aggregate global knowledge on the master ma-chine (the absolute time stays roughly stable) This indicates thatfor a high number of machines the aggregation of global knowl-edge will eventually become a bottleneck and an alternative distri-bution strategy may be preferable subject to further investigationOverall however we see that the total execution times roughlyhalve when the number of slave machines are doubled we observea 0528 0536 and 0561 reduction in total task executiontime moving from 1 slave machine to 2 2 to 4 and 4 to 8 respec-tively (here a value of 05 would indicate linear scaling out)

44 Results evaluation

We extracted 1193 million raw owlsameAs quadruples fromthe full corpus Of these however there were only 377 million un-ique triples After closure the data formed 216 million equivalenceclasses mentioning 575 million terms (624 of URIs)mdashan averageof 265 elements per equivalence class Of the 575 million termsonly 4156 were blank-nodes Fig 3 presents the distribution ofsizes of the equivalence classes where the largest equivalenceclass contains 8481 equivalent entities and 16 million (741)equivalence classes contain the minimum two equivalent identifi-ers Fig 4 shows a similar distribution but for the number of PLDsextracted from URIs in each equivalence class Interestingly themajority (571) of equivalence classes contained identifiers from

more than one PLD where the most diverse equivalence class con-tained URIs from 32 PLDs

Table 6 shows the canonical URIs for the largest five equiva-lence classes we manually inspected the results and showwhether or not the results were verified as correctincorrect In-deed results for class 1 and 2 were deemed incorrect due toover-use of owlsameAs for linking drug-related entities in theDailyMed and LinkedCT exporters Results 3 and 5 were verifiedas correct consolidation of prominent Semantic Web relatedauthors resp Dieter Fensel and Rudi Studermdashauthors are gi-ven many duplicate URIs by the RKBExplorer coreference index12

Result 4 contained URIs from various sites generally referring tothe United States mostly from DBPedia and LastFM With respectto the DBPedia URIs these (i) were equivalent but for capitilisationvariations or stop-words (ii) were variations of abbreviations or va-lid synonyms (iii) were different language versions (eg dbpediaEtats_Unis) (iv) were nicknames (eg dbpediaYankee_land)(v) were related but not equivalent (eg dbpediaAmerican_Civilization) (vi) were just noise (eg dbpediaLOL_Dean)

Besides the largest equivalence classes which we have seen areprone to errors perhaps due to the snowballing effect of the tran-sitive and symmetric closure we also manually evaluated a ran-dom sample of results For the sampling we take the closed

1

s24

Table 6Largest five equivalence classes (from [32])

Canonical term (lexically first in equivalence class) Size Correct

1 httpbio2rdforgdailymed_drugs1000 84812 httpbio2rdforgdailymed_drugs1042 8003 httpacmrkbexplorercomidperson-53292-22877d02973d0d01e8f29c7113776e7e 443 U

4 httpagame2teachcomddb61cae0e083f705f65944cc3bb3968ce3f3ab59-ge_1 353 U5 httpwwwacmrkbexplorercomidperson-236166-1b4ef5fdf4a5216256064c45a8923bc9 316 U

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 87

equivalence classes extracted from the full corpus pick an identi-fier (possibly non-canonical) at random and then randomly pickanother coreferent identifier from its equivalence class We thenretrieve the data associated for both identifiers and manually in-spect them to see if they are in fact equivalent For each pairwe then select one of four options SAME DIFFERENT UNCLEAR TRIVIALLY

SAME We used the UNCLEAR option sparingly and only where therewas not enough information to make an informed decision or fordifficult subjective choices note that where there was not enoughinformation in our corpus we tried to dereference more informa-tion to minimise use of UNCLEAR in terms of difficult subjectivechoices one example pair we marked unclear was dbpediaFolk-core and dbpediaExperimental_folk We used the TRIVIALLY

SAME option to indicate that an equivalence is purely lsquolsquosyntacticrsquorsquowhere one identifier has no information other than being the targetof a owlsameAs link although the identifiers are still coreferentwe distinguish this case since the equivalence is not so lsquolsquomeaning-fulrsquorsquo For the 1000 pairs we choose option TRIVIALLY SAME 661 times(661) SAME 301 times (301) DIFFERENT 28 times (28) and UN-

CLEAR 10 times (1) In summary our manual verification of thebaseline results puts the precision at 972 albeit with manysyntatic equivalences found Of those found to be incorrect manywere closely-related DBpedia resources that were linked to by thesame resource on the Freebase exporter an example of this wasdbpediaRock_Hudson and dbpediaRock_Hudson_filmogra-phy having owlsameAs links from the same Freebase conceptfbrock_hudson

Moving on in Table 7 we give the most frequently co-occurringPLD-pairs in our equivalence classes where datasets resident onthese domains are lsquolsquoheavilyrsquorsquo interlinked with owl sameAs rela-tions We italicise the indexes of inter-domain links that were ob-served to often be purely syntactic

With respect to consolidation identifiers in 786 million subjectpositions (7 of subject positions) and 232 million non-rdftype-object positions (26) were rewritten giving a total of1019 million positions rewritten (51 of total rewritable posi-tions) The average number of documents mentioning each URIrose slightly from 4691 to 4719 (a 06 increase) due to consoli-dation and the average number of PLDs also rose slightly from1005 to 1007 (a 02 increase)

Table 7Top 10 PLD pairs co-occurring in the equivalence classes with number of equivalenceclasses they co-occur in

PLD PLD Co-occur

1 rdfizecom uriburnercom 5066232 dbpediaorg freebasecom 1870543 bio2rdforg purlorg 1853924 locgov infolcauthorities

a 1666505 l3sde rkbexplorercom 1258426 bibsonomyorg l3sde 1258327 bibsonomyorg rkbexplorercom 1258308 dbpediaorg mpiide 998279 freebasecom mpiide 8961010 dbpediaorg umbelorg 65565

a In fact these were within the same PLD

5 Extended reasoning consolidation

We now look at extending the baseline approach to includemore expressive reasoning capabilities in particular using OWL 2RLRDF rules (introduced previously in Section 26) to infer novelowlsameAs relations that can then be used for consolidationalongside the explicit relations asserted by publishers

As described in Section 22 OWL provides various featuresmdashincluding functional properties inverse-functional properties andcertain cardinality restrictionsmdashthat allow for inferring novelowlsameAs data Such features could ease the burden on publish-ers of providing explicit owlsameAs mappings to coreferent iden-tifiers in remote naming schemes For example inverse-functionalproperties can be used in conjunction with legacy identificationschemesmdashsuch as ISBNs for books EAN UCC-13 or MPN for prod-ucts MAC addresses for network-enabled devices etcmdashto boot-strap identity on the Web of Data within certain domains suchidentification values can be encoded as simple datatype stringsthus bypassing the requirement for bespoke agreement or map-pings between URIs Also lsquolsquoinformation resourcesrsquorsquo with indigenousURIs can be used for indirectly identifying related resources towhich they have a functional mapping where examples includepersonal email-addresses personal homepages WIKIPEDIA articlesetc

Although Linked Data literature has not explicitly endorsed orencouraged such usage prominent grass-roots efforts publishingRDF on the Web rely (or have relied) on inverse-functional proper-ties for maintaining consistent identity For example in the FriendOf A Friend (FOAF) community a technique called smushing wasproposed to leverage such properties for identity serving as anearly precursor to methods described herein13

Versus the baseline consolidation approach we must first ap-ply some reasoning techniques to infer novel owlsameAs rela-tions Once we have the inferred owlsameAs data materialisedand merged with the asserted owlsameAs relations we can thenapply the same consolidation approach as per the baseline (i) ap-ply the transitive symmetric closure and generate the equivalenceclasses (ii) pick canonical identifiers for each class and (iii) re-write the corpus However in contrast to the baseline we expectthe extended volume of owlsameAs data to make in-memorystorage infeasible14 Thus in this section we avoid the need tostore equivalence information in-memory where we instead inves-tigate batch-processing methods for applying an on-disk closure ofthe equivalence relations and likewise on-disk methods for rewrit-ing data

Early versions of the local batch-processing [31] techniques andsome of the distributed reasoning techniques [33] are borrowedfrom previous works Herein we reformulate and combine thesetechniques into a complete solution for applying enhanced consol-idation in a distributed scalable manner over large-scale Linked

13 cf httpwikifoaf-projectorgwSmushing retr 2011012214 Further note that since all machines must have all owlsameAs information

adding more machines does not increase the capacity for handling more suchrelations the scalability of the baseline approach is limited by the machine with theleast in-memory capacity

88 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Data corpora We provide new evaluation over our 1118 billionquadruple evaluation corpus presenting new performance resultsover the full corpus and for a varying number of slave machinesWe also analyse in depth the results of applying the techniquesover our evaluation corpus analysing the applicability of ourmethods in such scenarios We contrast the results given by thebaseline and extended consolidation approaches and again manu-ally evaluate the results of the extended consolidation approach for1000 pairs found to be coreferent

51 High-level approach

In Table 2 we provide the pertinent rules for inferring new owl

sameAs relations from the data However after analysis of thedata we observed that no documents used the owlmaxQuali-

fiedCardinality construct required for the cls-maxqc rulesand that only one document defined one owlhasKey axiom15

involving properties with less than five occurrences in the datamdashhence we leave implementation of these rules for future work andnote that these new OWL 2 constructs have probably not yet had timeto find proper traction on the Web Thus on top of inferencing involv-ing explicit owlsameAs we are left with rule prp-fp which supportsthe semantics of properties typed owlFunctionalProperty andrule prp-ifp which supports the semantics of properties typedowlInverseFunctionalProperty and rule cls-maxc1 whichsupports the semantics of classes with a specified cardinality of 1for some defined property (a class restricted version of the func-tional-property inferencing)16

Thus we look at using OWL 2 RLRDF rules prp-fp prp-ifp andcls-maxc2 for inferring new owl sameAs relations between indi-vidualsmdashwe also support an additional rule that gives an exact car-dinality version of cls-maxc217 Herein we refer to these rules asconsolidation rules

We also investigate pre-applying more general OWL 2 RLRDFreasoning over the corpus to derive more complete results wherethe ruleset is available in Table 3 and is restricted to OWL 2 RLRDFrules with one assertional pattern [33] unlike the consolidationrules these general rules do not require the computation of asser-tional joins and thus are amenable to execution by means of a sin-gle-scan of the corpus We have demonstrated this profile ofreasoning to have good competency with respect to the features ofRDFS and OWL used in Linked Data [29] These general rules mayindirectly lead to the inference of additional owlsameAs relationsparticularly when combined with the consolidation rules of Table 2

Once we have used reasoning to derive novel owlsameAs

data we then apply an on-disk batch processing technique toclose the equivalence relation and to ultimately consolidate thecorpus

Thus our high-level approach is as follows

(i) extract relevant terminological data from the corpus(ii) bind the terminological patterns in the rules from this data

thus creating a larger set of general rules with only oneassertional pattern and identifying assertional patterns thatare useful for consolidation

(iii) apply general-rules over the corpus and buffer any inputinferred statements relevant for consolidation to a new file

(iv) derive the closure of owlsameAs statements from the con-solidation-relevant dataset

15 httphuemerlstadlernetrolerhrdf16 Note that we have presented non-distributed batch-processing execution of

these rules at a smaller scale (147 million quadruples) in [31]17 Exact cardinalities are disallowed in OWL 2 RL due to their effect on the formal

proposition of completeness underlying the profile but such considerations are mootin our scenario

(v) apply consolidation over the main corpus with respect to theclosed owlsameAs data

In Step (i) we extract terminological datamdashrequired for applica-tion of our rulesmdashfrom the main corpus As discussed in Section 25to help ensure robustness we apply authoritative reasoning inother words at this stage we discard any third-party terminologi-cal statements that affect inferencing over instance data for remoteclasses and properties

In Step (ii) we then compile the authoritative terminologicalstatements into the rules to generate a set of T-ground (purelyassertional) rules which can then be applied over the entirety ofthe corpus [33] As mentioned in Section 26 the T-ground generalrules only contain one assertional pattern and thus are amenableto execution by a single scan eg given the statement

homepage rdfssubPropertyOf isPrimaryTopicOf

we would generate a T-ground rulex homepage y

)x isPrimaryTopicOf y

which does not require the computation of joins We also bindthe terminological patterns of the consolidation rules in Table 2however these rules cannot be performed by means of a singlescan over the data since they contain multiple assertional pat-terns in the body Thus we instead extract triple patternssuch as

x isPrimaryTopicOf y

which may indicate data useful for consolidation (note that hereisPrimaryTopicOf is assumed to be inverse-functional)

In Step (iii) we then apply the T-ground general rules over thecorpus with a single scan Any input or inferred statements match-ing a consolidation-relevant patternmdashincluding any owlsameAs

statements foundmdashare buffered to a file for later join computationSubsequently in Step (iv) we must now compute the on-disk

canonicalised closure of the owlsameAs statements In particularwe mainly employ the following three on-disk primitives

(i) sequential scans of flat files containing line-delimitedtuples18

(ii) external-sorts where batches of statements are sorted inmemory the sorted batches written to disk and the sortedbatches merged to the final output and

(iii) merge-joins where multiple sets of data are sorted accordingto their required join position and subsequently scanned inan interleaving manner that aligns on the join position andwhere an in-memory join is applied for each individual joinelement

Using these primitives to compute the owlsameAs closureminimises the amount of main memory required where we havepresented similar approaches in [3031]

First assertional memberships of functional properties inverse-functional properties and cardinality restrictions (both propertiesand classes) are written to separate on-disk files For functional-property and cardinality reasoning a consistent join variable forthe assertional patterns is given by the subject position for in-verse-functional-property reasoning a join variable is given bythe object position19 Thus we can sort the former sets of data

8 These files are G-Zip compressed flat files of N-Triple-like syntax encodingrbitrary length tuples of RDF constants9 Although a predicate-position join is also available we prefer object-positionins that provide smaller batches of data for the in-memory join

1

a1

jo

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 89

according to subject and perform a merge-join by means of a linearscan thereafter the same procedure applies to the latter set sortingand merge-joining on the object position Applying merge-join scanswe produce new owlsameAs statements

Both the originally asserted and newly inferred owlsameAs

relations are similarly written to an on-disk file over which wenow wish to perform the canonicalised symmetrictransitive clo-sure We apply a similar method again leveraging external sortsand merge-joins to perform the computation (herein we sketchand point the interested reader to [31 Section 46]) In the follow-ing we use gt and lt to denote lexical ordering SA as a shortcut forowlsameAs a b c etc to denote members of U [ B such thata lt b lt c and define URIs to be lexically lower than blank nodesThe process is as follows

(i) we only materialise symmetric equality relations thatinvolve a (possibly intermediary) canonical term chosenby a lexical ordering given b SA a we materialise a SA bgiven a SA b SA c we materialise the relations a SA b aSA c and their inverses but do not materialise b SA c orits inverse

(ii) transitivity is supported by iterative merge-join scans

20 One could instead build an on-disk map for equivalence classes and pivoelements and follow a consolidation method similar to the previous section over theunordered data however we would expect such an on-disk index to have a low cachehit-rate given the nature of the data which would lead to many (slow) disk seeks Analternative approach might be to hash-partition the corpus and equality index overdifferent machines however this would require a non-trivial minimum amount omemory on the cluster

ndash in the scan if we find c SA a (sorted by object) and c SA d(sorted naturally) we infer a SA d and drop the non-canonical c SA d (and d SA c)

ndash at the end of the scan newly inferred triples are marked and merge-joined

into the main equality data any triples echoing an earlier inference are ignored dropped non-canonical statements are removed

ndash the process is then iterative in the next scan if we find dSA a and d SA e we infer a SA e and e SA a

ndash inferences will only occur if they involve a statementadded in the previous iteration ensuring that inferencesteps are not re-computed and that the computation willterminate

(iii) the above iterations stop when a fixpoint is reached andnothing new is inferred

(iv) the process of reaching a fixpoint is accelerated using avail-able main-memory to store a cache of partial equalitychains

The above steps follow well-known methods for transtive clo-sure computation modified to support the symmetry ofowlsameAs and to use adaptive canonicalisation in order toavoid quadratic output With respect to the last item we useAlgorithm 1 to derive lsquolsquobatchesrsquorsquo of in-memory equivalencesand when in-memory capacity is achieved we write thesebatches to disk and proceed with on-disk computation this isparticularly useful for computing the small number of longequality chains which would otherwise require sorts andmerge-joins over all of the canonical owlsameAs data currentlyderived and where the number of iterations would otherwise bethe length of the longest chain The result of this process is a setof canonicalised equality relations representing the symmetrictransitive closure

Next we briefly describe the process of canonicalising data withrespect to this on-disk equality closure where we again use exter-nal-sorts and merge-joins First we prune the owlsameAs indexto only maintain relations s1 SA s2 such that s1 gt s2 thus given s1

SA s2 we know that s2 is the canonical identifier and s1 is to berewritten We then sort the data according to the position thatwe wish to rewrite and perform a merge-join over both sets ofdata buffering the canonicalised data to an output file If we wantto rewrite multiple positions of a file of tuples (eg subject and

object) we must rewrite one position sort the results by the sec-ond position and then rewrite the second position20

Finally note that in the derivation of owlsameAs from the con-solidation rules prp-fp prp-ifp cax-maxc2 the overall processmay be iterative For example consider

dblpAxel foafisPrimaryTopicOf lthttppolleresnetgt

axelme foafisPrimaryTopicOf lthttpaxelderiiegt

lthttppolleresnetgt owlsameAs lthttpaxelderiiegt

from which the conclusion that dblpAxel is the same asaxelme holds We see that new owlsameAs relations (either as-serted or derived from the consolidation rules) may in turn lsquolsquoalignrsquorsquoterms in the join position of the consolidation rules leading to newequivalences Thus for deriving the final owlsameAs we require ahigher-level iterative process as follows

(i) initially apply the consolidation rules and append theresults to a file alongside the asserted owl sameAs state-ments found

(ii) apply the initial closure of the owlsameAs data(iii) then iteratively until no new owlsameAs inferences are

found

ndash rewrite the join positions of the on-disk files containing

the data for each consolidation rule according to the cur-rent owlsameAs data

ndash derive new owlsameAs inferences possible through theprevious rewriting for each consolidation rule

ndash re-derive the closure of the owlsameAs data includingthe new inferences

The final closed file of owlsameAs data can then be re-used torewrite the main corpus in two sorts and merge-join scans oversubject and object

52 Distributed approach

The distributed approach follows quite naturally from theprevious discussion As before we assume that the input dataare evenly pre-distributed over the slave machines (in anyarbitrary ordering) where we can then apply the followingprocess

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract relevant terminologicalknowledge

(ii) Gather gather terminological data onto the master machineand thereafter bind the terminological patterns of the gen-eralconsolidation rules

(iii) Flood flood the rules for reasoning and the consolidation-relevant patterns to all slave machines

(iv) Run apply reasoning and extract consolidation-relevantstatements from the input and inferred data

(v) Gather gather all consolidation statements onto the mastermachine then in parallel

t

f

Table 8Breakdown of timing of distributed extended consolidation wreasoning where identifies the masterslave tasks that run concurrently

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 4543 1000 2302 1000 1275 1000 813 1000 7404 1000Executing 299 66 303 131 325 255 301 370 2436 329

Aggregate Consolidation Relevant Data 43 09 45 20 47 37 46 57 299 40Closing owlsameAs

244 54 245 106 254 199 244 300 2112 285Miscellaneous 12 03 13 05 24 19 11 13 25 03

Idle (waiting for slaves) 4245 934 1999 869 950 745 512 630 4968 671

SlaveAvg Executing (total) 4489 988 2212 961 1116 875 565 694 5991 809

Extract Terminology 444 98 227 99 118 92 69 84 494 67Extract Consolidation Relevant Data 955 210 483 210 243 191 132 162 1379 186Initial Sort (by subject) 980 216 483 210 237 186 116 142 1338 181Consolidation 2110 464 1019 443 519 407 249 306 2780 375

Avg Idle 55 12 89 39 379 297 236 290 1413 191Waiting for peers 00 00 32 14 71 56 63 77 375 51Waiting for master 55 12 58 25 308 241 173 212 1038 140

90 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

ndash Local compute the closure of the consolidation rules andthe owlsameAs data on the master machine

ndash Run each slave machine sorts its fragment of the maincorpus according to natural order (spoc)

1 At least in terms of pure quantity However we do not give an indication of theuality or importance of those few equivalences we miss with this approximationhich may well be application specific

(vi) Flood send the closed owlsameAs data to the slavemachines once the distributed sort has been completed

(vii) Run each slave machine then rewrites the subjects of theirsegment of the corpus subsequently sorts the rewritten databy object and then rewrites the objects (of non-rdftypetriples) with respect to the closed owlsameAs data

53 Performance evaluation

Applying the above process to our 1118 billion quadruple corpustook 1234 h Extracting the terminological data took 114 h with anaverage idle time of 19 min (277) (one machine took 18 minlonger than the rest due to processing a large ontology containing2 million quadruples [32]) Merging and aggregating the termino-logical data took roughly 1 min Applying the reasoning andextracting the consolidation relevant statements took 234 h withan average idle time of 25 min (18) Aggregating and mergingthe consolidation relevant statements took 299 min Thereafterlocally computing the closure of the consolidation rules and theequality data took 352 h with the computation requiring two itera-tions overall (the minimum possiblemdashthe second iteration did notproduce any new results) Concurrent to the previous step the paral-lel sort of remote data by natural order took 233 h with an averageidle time of 6 min (43) Subsequent parallel consolidation of thedata took 48 h with 10 min (35) average idle timemdashof this 19of the time was spent consolidating the pre-sorted subjects 60of the time was spent sorting the rewritten data by object and21 of the time was spent consolidating the objects of the data

As before Table 8 summarises the timing of the task Focusing onthe performance for the entire corpus the master machine took406 h to coordinate global knowledge constituting the lower boundon time possible for the task to execute with respect to increasingmachines in our setupmdashin future it may be worthwhile to investigatedistributed strategies for computing the owlsameAs closure(which takes 285 of the total computation time) but for the mo-ment we mitigate the cost by concurrently running a sort on theslave machines thus keeping the slaves busy for 634 of the timetaken for this local aggregation step The slave machines were onaverage busy for 809 of the total task time of the idle time733 was spent waiting for the master machine to aggregate theconsolidation relevant data and to finish the closure of owl sameAs

data and the balance (267) was spent waiting for peers to finish(mostly during the extraction of terminological data)

With regards performance evaluation over the 100 million qua-druple corpus for a varying number of slave machines perhapsmost notably the average idle time of the slave machines increasesquite rapidly as the number of machines increases In particular by4 machines the slaves can perform the distributed sort-by-subjecttask faster than the master machine can perform the local owlsameAs closure The significant workload of the master machinewill eventually become a bottleneck as the number of slave ma-chines increases further This is also noticeable in terms of overalltask time the respective time savings as the number of slave ma-chines doubles (moving from 1 slave machine to 8) are 0507 0554 and 0638 respectively

Briefly we also ran the consolidation without the general rea-soning rules (Table 3) motivated earlier With respect to perfor-mance the main variations were given by (i) the extraction ofconsolidation relevant statementsmdashthis time directly extractedfrom explicit statements as opposed to explicit and inferred state-mentsmdashwhich took 154 min (11 of the time taken including thegeneral reasoning) with an average idle time of less than one min-ute (6 average idle time) (ii) local aggregation of the consolida-tion relevant statements took 17 min (569 of the time takenpreviously) (iii) local closure of the owlsameAs data took318 h (904 of the time taken previously) The total time savedequated to 28h (227) where 333 min were saved from coordi-nation on the master machine and 225 h were saved from parallelexecution on the slave machines

54 Results evaluation

In this section we present the results of the consolidationwhich included the general reasoning step in the extraction ofthe consolidation-relevant statements In fact we found that theonly major variation between the two approaches was in theamount of consolidation-relevant statements collected (discussedpresently) where the changes in the total amounts of coreferentidentifiers equivalence classes and canonicalised terms were infact negligible (lt01) Thus for our corpus extracting only as-serted consolidation-relevant statements offered a very closeapproximation of the extended reasoning approach21

2

qw

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 12: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

Table 6Largest five equivalence classes (from [32])

Canonical term (lexically first in equivalence class) Size Correct

1 httpbio2rdforgdailymed_drugs1000 84812 httpbio2rdforgdailymed_drugs1042 8003 httpacmrkbexplorercomidperson-53292-22877d02973d0d01e8f29c7113776e7e 443 U

4 httpagame2teachcomddb61cae0e083f705f65944cc3bb3968ce3f3ab59-ge_1 353 U5 httpwwwacmrkbexplorercomidperson-236166-1b4ef5fdf4a5216256064c45a8923bc9 316 U

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 87

equivalence classes extracted from the full corpus pick an identi-fier (possibly non-canonical) at random and then randomly pickanother coreferent identifier from its equivalence class We thenretrieve the data associated for both identifiers and manually in-spect them to see if they are in fact equivalent For each pairwe then select one of four options SAME DIFFERENT UNCLEAR TRIVIALLY

SAME We used the UNCLEAR option sparingly and only where therewas not enough information to make an informed decision or fordifficult subjective choices note that where there was not enoughinformation in our corpus we tried to dereference more informa-tion to minimise use of UNCLEAR in terms of difficult subjectivechoices one example pair we marked unclear was dbpediaFolk-core and dbpediaExperimental_folk We used the TRIVIALLY

SAME option to indicate that an equivalence is purely lsquolsquosyntacticrsquorsquowhere one identifier has no information other than being the targetof a owlsameAs link although the identifiers are still coreferentwe distinguish this case since the equivalence is not so lsquolsquomeaning-fulrsquorsquo For the 1000 pairs we choose option TRIVIALLY SAME 661 times(661) SAME 301 times (301) DIFFERENT 28 times (28) and UN-

CLEAR 10 times (1) In summary our manual verification of thebaseline results puts the precision at 972 albeit with manysyntatic equivalences found Of those found to be incorrect manywere closely-related DBpedia resources that were linked to by thesame resource on the Freebase exporter an example of this wasdbpediaRock_Hudson and dbpediaRock_Hudson_filmogra-phy having owlsameAs links from the same Freebase conceptfbrock_hudson

Moving on in Table 7 we give the most frequently co-occurringPLD-pairs in our equivalence classes where datasets resident onthese domains are lsquolsquoheavilyrsquorsquo interlinked with owl sameAs rela-tions We italicise the indexes of inter-domain links that were ob-served to often be purely syntactic

With respect to consolidation identifiers in 786 million subjectpositions (7 of subject positions) and 232 million non-rdftype-object positions (26) were rewritten giving a total of1019 million positions rewritten (51 of total rewritable posi-tions) The average number of documents mentioning each URIrose slightly from 4691 to 4719 (a 06 increase) due to consoli-dation and the average number of PLDs also rose slightly from1005 to 1007 (a 02 increase)

Table 7Top 10 PLD pairs co-occurring in the equivalence classes with number of equivalenceclasses they co-occur in

PLD PLD Co-occur

1 rdfizecom uriburnercom 5066232 dbpediaorg freebasecom 1870543 bio2rdforg purlorg 1853924 locgov infolcauthorities

a 1666505 l3sde rkbexplorercom 1258426 bibsonomyorg l3sde 1258327 bibsonomyorg rkbexplorercom 1258308 dbpediaorg mpiide 998279 freebasecom mpiide 8961010 dbpediaorg umbelorg 65565

a In fact these were within the same PLD

5 Extended reasoning consolidation

We now look at extending the baseline approach to includemore expressive reasoning capabilities in particular using OWL 2RLRDF rules (introduced previously in Section 26) to infer novelowlsameAs relations that can then be used for consolidationalongside the explicit relations asserted by publishers

As described in Section 22 OWL provides various featuresmdashincluding functional properties inverse-functional properties andcertain cardinality restrictionsmdashthat allow for inferring novelowlsameAs data Such features could ease the burden on publish-ers of providing explicit owlsameAs mappings to coreferent iden-tifiers in remote naming schemes For example inverse-functionalproperties can be used in conjunction with legacy identificationschemesmdashsuch as ISBNs for books EAN UCC-13 or MPN for prod-ucts MAC addresses for network-enabled devices etcmdashto boot-strap identity on the Web of Data within certain domains suchidentification values can be encoded as simple datatype stringsthus bypassing the requirement for bespoke agreement or map-pings between URIs Also lsquolsquoinformation resourcesrsquorsquo with indigenousURIs can be used for indirectly identifying related resources towhich they have a functional mapping where examples includepersonal email-addresses personal homepages WIKIPEDIA articlesetc

Although Linked Data literature has not explicitly endorsed orencouraged such usage prominent grass-roots efforts publishingRDF on the Web rely (or have relied) on inverse-functional proper-ties for maintaining consistent identity For example in the FriendOf A Friend (FOAF) community a technique called smushing wasproposed to leverage such properties for identity serving as anearly precursor to methods described herein13

Versus the baseline consolidation approach we must first ap-ply some reasoning techniques to infer novel owlsameAs rela-tions Once we have the inferred owlsameAs data materialisedand merged with the asserted owlsameAs relations we can thenapply the same consolidation approach as per the baseline (i) ap-ply the transitive symmetric closure and generate the equivalenceclasses (ii) pick canonical identifiers for each class and (iii) re-write the corpus However in contrast to the baseline we expectthe extended volume of owlsameAs data to make in-memorystorage infeasible14 Thus in this section we avoid the need tostore equivalence information in-memory where we instead inves-tigate batch-processing methods for applying an on-disk closure ofthe equivalence relations and likewise on-disk methods for rewrit-ing data

Early versions of the local batch-processing [31] techniques andsome of the distributed reasoning techniques [33] are borrowedfrom previous works Herein we reformulate and combine thesetechniques into a complete solution for applying enhanced consol-idation in a distributed scalable manner over large-scale Linked

13 cf httpwikifoaf-projectorgwSmushing retr 2011012214 Further note that since all machines must have all owlsameAs information

adding more machines does not increase the capacity for handling more suchrelations the scalability of the baseline approach is limited by the machine with theleast in-memory capacity

88 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Data corpora We provide new evaluation over our 1118 billionquadruple evaluation corpus presenting new performance resultsover the full corpus and for a varying number of slave machinesWe also analyse in depth the results of applying the techniquesover our evaluation corpus analysing the applicability of ourmethods in such scenarios We contrast the results given by thebaseline and extended consolidation approaches and again manu-ally evaluate the results of the extended consolidation approach for1000 pairs found to be coreferent

51 High-level approach

In Table 2 we provide the pertinent rules for inferring new owl

sameAs relations from the data However after analysis of thedata we observed that no documents used the owlmaxQuali-

fiedCardinality construct required for the cls-maxqc rulesand that only one document defined one owlhasKey axiom15

involving properties with less than five occurrences in the datamdashhence we leave implementation of these rules for future work andnote that these new OWL 2 constructs have probably not yet had timeto find proper traction on the Web Thus on top of inferencing involv-ing explicit owlsameAs we are left with rule prp-fp which supportsthe semantics of properties typed owlFunctionalProperty andrule prp-ifp which supports the semantics of properties typedowlInverseFunctionalProperty and rule cls-maxc1 whichsupports the semantics of classes with a specified cardinality of 1for some defined property (a class restricted version of the func-tional-property inferencing)16

Thus we look at using OWL 2 RLRDF rules prp-fp prp-ifp andcls-maxc2 for inferring new owl sameAs relations between indi-vidualsmdashwe also support an additional rule that gives an exact car-dinality version of cls-maxc217 Herein we refer to these rules asconsolidation rules

We also investigate pre-applying more general OWL 2 RLRDFreasoning over the corpus to derive more complete results wherethe ruleset is available in Table 3 and is restricted to OWL 2 RLRDFrules with one assertional pattern [33] unlike the consolidationrules these general rules do not require the computation of asser-tional joins and thus are amenable to execution by means of a sin-gle-scan of the corpus We have demonstrated this profile ofreasoning to have good competency with respect to the features ofRDFS and OWL used in Linked Data [29] These general rules mayindirectly lead to the inference of additional owlsameAs relationsparticularly when combined with the consolidation rules of Table 2

Once we have used reasoning to derive novel owlsameAs

data we then apply an on-disk batch processing technique toclose the equivalence relation and to ultimately consolidate thecorpus

Thus our high-level approach is as follows

(i) extract relevant terminological data from the corpus(ii) bind the terminological patterns in the rules from this data

thus creating a larger set of general rules with only oneassertional pattern and identifying assertional patterns thatare useful for consolidation

(iii) apply general-rules over the corpus and buffer any inputinferred statements relevant for consolidation to a new file

(iv) derive the closure of owlsameAs statements from the con-solidation-relevant dataset

15 httphuemerlstadlernetrolerhrdf16 Note that we have presented non-distributed batch-processing execution of

these rules at a smaller scale (147 million quadruples) in [31]17 Exact cardinalities are disallowed in OWL 2 RL due to their effect on the formal

proposition of completeness underlying the profile but such considerations are mootin our scenario

(v) apply consolidation over the main corpus with respect to theclosed owlsameAs data

In Step (i) we extract terminological datamdashrequired for applica-tion of our rulesmdashfrom the main corpus As discussed in Section 25to help ensure robustness we apply authoritative reasoning inother words at this stage we discard any third-party terminologi-cal statements that affect inferencing over instance data for remoteclasses and properties

In Step (ii) we then compile the authoritative terminologicalstatements into the rules to generate a set of T-ground (purelyassertional) rules which can then be applied over the entirety ofthe corpus [33] As mentioned in Section 26 the T-ground generalrules only contain one assertional pattern and thus are amenableto execution by a single scan eg given the statement

homepage rdfssubPropertyOf isPrimaryTopicOf

we would generate a T-ground rulex homepage y

)x isPrimaryTopicOf y

which does not require the computation of joins We also bindthe terminological patterns of the consolidation rules in Table 2however these rules cannot be performed by means of a singlescan over the data since they contain multiple assertional pat-terns in the body Thus we instead extract triple patternssuch as

x isPrimaryTopicOf y

which may indicate data useful for consolidation (note that hereisPrimaryTopicOf is assumed to be inverse-functional)

In Step (iii) we then apply the T-ground general rules over thecorpus with a single scan Any input or inferred statements match-ing a consolidation-relevant patternmdashincluding any owlsameAs

statements foundmdashare buffered to a file for later join computationSubsequently in Step (iv) we must now compute the on-disk

canonicalised closure of the owlsameAs statements In particularwe mainly employ the following three on-disk primitives

(i) sequential scans of flat files containing line-delimitedtuples18

(ii) external-sorts where batches of statements are sorted inmemory the sorted batches written to disk and the sortedbatches merged to the final output and

(iii) merge-joins where multiple sets of data are sorted accordingto their required join position and subsequently scanned inan interleaving manner that aligns on the join position andwhere an in-memory join is applied for each individual joinelement

Using these primitives to compute the owlsameAs closureminimises the amount of main memory required where we havepresented similar approaches in [3031]

First assertional memberships of functional properties inverse-functional properties and cardinality restrictions (both propertiesand classes) are written to separate on-disk files For functional-property and cardinality reasoning a consistent join variable forthe assertional patterns is given by the subject position for in-verse-functional-property reasoning a join variable is given bythe object position19 Thus we can sort the former sets of data

8 These files are G-Zip compressed flat files of N-Triple-like syntax encodingrbitrary length tuples of RDF constants9 Although a predicate-position join is also available we prefer object-positionins that provide smaller batches of data for the in-memory join

1

a1

jo

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 89

according to subject and perform a merge-join by means of a linearscan thereafter the same procedure applies to the latter set sortingand merge-joining on the object position Applying merge-join scanswe produce new owlsameAs statements

Both the originally asserted and newly inferred owlsameAs

relations are similarly written to an on-disk file over which wenow wish to perform the canonicalised symmetrictransitive clo-sure We apply a similar method again leveraging external sortsand merge-joins to perform the computation (herein we sketchand point the interested reader to [31 Section 46]) In the follow-ing we use gt and lt to denote lexical ordering SA as a shortcut forowlsameAs a b c etc to denote members of U [ B such thata lt b lt c and define URIs to be lexically lower than blank nodesThe process is as follows

(i) we only materialise symmetric equality relations thatinvolve a (possibly intermediary) canonical term chosenby a lexical ordering given b SA a we materialise a SA bgiven a SA b SA c we materialise the relations a SA b aSA c and their inverses but do not materialise b SA c orits inverse

(ii) transitivity is supported by iterative merge-join scans

20 One could instead build an on-disk map for equivalence classes and pivoelements and follow a consolidation method similar to the previous section over theunordered data however we would expect such an on-disk index to have a low cachehit-rate given the nature of the data which would lead to many (slow) disk seeks Analternative approach might be to hash-partition the corpus and equality index overdifferent machines however this would require a non-trivial minimum amount omemory on the cluster

ndash in the scan if we find c SA a (sorted by object) and c SA d(sorted naturally) we infer a SA d and drop the non-canonical c SA d (and d SA c)

ndash at the end of the scan newly inferred triples are marked and merge-joined

into the main equality data any triples echoing an earlier inference are ignored dropped non-canonical statements are removed

ndash the process is then iterative in the next scan if we find dSA a and d SA e we infer a SA e and e SA a

ndash inferences will only occur if they involve a statementadded in the previous iteration ensuring that inferencesteps are not re-computed and that the computation willterminate

(iii) the above iterations stop when a fixpoint is reached andnothing new is inferred

(iv) the process of reaching a fixpoint is accelerated using avail-able main-memory to store a cache of partial equalitychains

The above steps follow well-known methods for transtive clo-sure computation modified to support the symmetry ofowlsameAs and to use adaptive canonicalisation in order toavoid quadratic output With respect to the last item we useAlgorithm 1 to derive lsquolsquobatchesrsquorsquo of in-memory equivalencesand when in-memory capacity is achieved we write thesebatches to disk and proceed with on-disk computation this isparticularly useful for computing the small number of longequality chains which would otherwise require sorts andmerge-joins over all of the canonical owlsameAs data currentlyderived and where the number of iterations would otherwise bethe length of the longest chain The result of this process is a setof canonicalised equality relations representing the symmetrictransitive closure

Next we briefly describe the process of canonicalising data withrespect to this on-disk equality closure where we again use exter-nal-sorts and merge-joins First we prune the owlsameAs indexto only maintain relations s1 SA s2 such that s1 gt s2 thus given s1

SA s2 we know that s2 is the canonical identifier and s1 is to berewritten We then sort the data according to the position thatwe wish to rewrite and perform a merge-join over both sets ofdata buffering the canonicalised data to an output file If we wantto rewrite multiple positions of a file of tuples (eg subject and

object) we must rewrite one position sort the results by the sec-ond position and then rewrite the second position20

Finally note that in the derivation of owlsameAs from the con-solidation rules prp-fp prp-ifp cax-maxc2 the overall processmay be iterative For example consider

dblpAxel foafisPrimaryTopicOf lthttppolleresnetgt

axelme foafisPrimaryTopicOf lthttpaxelderiiegt

lthttppolleresnetgt owlsameAs lthttpaxelderiiegt

from which the conclusion that dblpAxel is the same asaxelme holds We see that new owlsameAs relations (either as-serted or derived from the consolidation rules) may in turn lsquolsquoalignrsquorsquoterms in the join position of the consolidation rules leading to newequivalences Thus for deriving the final owlsameAs we require ahigher-level iterative process as follows

(i) initially apply the consolidation rules and append theresults to a file alongside the asserted owl sameAs state-ments found

(ii) apply the initial closure of the owlsameAs data(iii) then iteratively until no new owlsameAs inferences are

found

ndash rewrite the join positions of the on-disk files containing

the data for each consolidation rule according to the cur-rent owlsameAs data

ndash derive new owlsameAs inferences possible through theprevious rewriting for each consolidation rule

ndash re-derive the closure of the owlsameAs data includingthe new inferences

The final closed file of owlsameAs data can then be re-used torewrite the main corpus in two sorts and merge-join scans oversubject and object

52 Distributed approach

The distributed approach follows quite naturally from theprevious discussion As before we assume that the input dataare evenly pre-distributed over the slave machines (in anyarbitrary ordering) where we can then apply the followingprocess

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract relevant terminologicalknowledge

(ii) Gather gather terminological data onto the master machineand thereafter bind the terminological patterns of the gen-eralconsolidation rules

(iii) Flood flood the rules for reasoning and the consolidation-relevant patterns to all slave machines

(iv) Run apply reasoning and extract consolidation-relevantstatements from the input and inferred data

(v) Gather gather all consolidation statements onto the mastermachine then in parallel

t

f

Table 8Breakdown of timing of distributed extended consolidation wreasoning where identifies the masterslave tasks that run concurrently

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 4543 1000 2302 1000 1275 1000 813 1000 7404 1000Executing 299 66 303 131 325 255 301 370 2436 329

Aggregate Consolidation Relevant Data 43 09 45 20 47 37 46 57 299 40Closing owlsameAs

244 54 245 106 254 199 244 300 2112 285Miscellaneous 12 03 13 05 24 19 11 13 25 03

Idle (waiting for slaves) 4245 934 1999 869 950 745 512 630 4968 671

SlaveAvg Executing (total) 4489 988 2212 961 1116 875 565 694 5991 809

Extract Terminology 444 98 227 99 118 92 69 84 494 67Extract Consolidation Relevant Data 955 210 483 210 243 191 132 162 1379 186Initial Sort (by subject) 980 216 483 210 237 186 116 142 1338 181Consolidation 2110 464 1019 443 519 407 249 306 2780 375

Avg Idle 55 12 89 39 379 297 236 290 1413 191Waiting for peers 00 00 32 14 71 56 63 77 375 51Waiting for master 55 12 58 25 308 241 173 212 1038 140

90 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

ndash Local compute the closure of the consolidation rules andthe owlsameAs data on the master machine

ndash Run each slave machine sorts its fragment of the maincorpus according to natural order (spoc)

1 At least in terms of pure quantity However we do not give an indication of theuality or importance of those few equivalences we miss with this approximationhich may well be application specific

(vi) Flood send the closed owlsameAs data to the slavemachines once the distributed sort has been completed

(vii) Run each slave machine then rewrites the subjects of theirsegment of the corpus subsequently sorts the rewritten databy object and then rewrites the objects (of non-rdftypetriples) with respect to the closed owlsameAs data

53 Performance evaluation

Applying the above process to our 1118 billion quadruple corpustook 1234 h Extracting the terminological data took 114 h with anaverage idle time of 19 min (277) (one machine took 18 minlonger than the rest due to processing a large ontology containing2 million quadruples [32]) Merging and aggregating the termino-logical data took roughly 1 min Applying the reasoning andextracting the consolidation relevant statements took 234 h withan average idle time of 25 min (18) Aggregating and mergingthe consolidation relevant statements took 299 min Thereafterlocally computing the closure of the consolidation rules and theequality data took 352 h with the computation requiring two itera-tions overall (the minimum possiblemdashthe second iteration did notproduce any new results) Concurrent to the previous step the paral-lel sort of remote data by natural order took 233 h with an averageidle time of 6 min (43) Subsequent parallel consolidation of thedata took 48 h with 10 min (35) average idle timemdashof this 19of the time was spent consolidating the pre-sorted subjects 60of the time was spent sorting the rewritten data by object and21 of the time was spent consolidating the objects of the data

As before Table 8 summarises the timing of the task Focusing onthe performance for the entire corpus the master machine took406 h to coordinate global knowledge constituting the lower boundon time possible for the task to execute with respect to increasingmachines in our setupmdashin future it may be worthwhile to investigatedistributed strategies for computing the owlsameAs closure(which takes 285 of the total computation time) but for the mo-ment we mitigate the cost by concurrently running a sort on theslave machines thus keeping the slaves busy for 634 of the timetaken for this local aggregation step The slave machines were onaverage busy for 809 of the total task time of the idle time733 was spent waiting for the master machine to aggregate theconsolidation relevant data and to finish the closure of owl sameAs

data and the balance (267) was spent waiting for peers to finish(mostly during the extraction of terminological data)

With regards performance evaluation over the 100 million qua-druple corpus for a varying number of slave machines perhapsmost notably the average idle time of the slave machines increasesquite rapidly as the number of machines increases In particular by4 machines the slaves can perform the distributed sort-by-subjecttask faster than the master machine can perform the local owlsameAs closure The significant workload of the master machinewill eventually become a bottleneck as the number of slave ma-chines increases further This is also noticeable in terms of overalltask time the respective time savings as the number of slave ma-chines doubles (moving from 1 slave machine to 8) are 0507 0554 and 0638 respectively

Briefly we also ran the consolidation without the general rea-soning rules (Table 3) motivated earlier With respect to perfor-mance the main variations were given by (i) the extraction ofconsolidation relevant statementsmdashthis time directly extractedfrom explicit statements as opposed to explicit and inferred state-mentsmdashwhich took 154 min (11 of the time taken including thegeneral reasoning) with an average idle time of less than one min-ute (6 average idle time) (ii) local aggregation of the consolida-tion relevant statements took 17 min (569 of the time takenpreviously) (iii) local closure of the owlsameAs data took318 h (904 of the time taken previously) The total time savedequated to 28h (227) where 333 min were saved from coordi-nation on the master machine and 225 h were saved from parallelexecution on the slave machines

54 Results evaluation

In this section we present the results of the consolidationwhich included the general reasoning step in the extraction ofthe consolidation-relevant statements In fact we found that theonly major variation between the two approaches was in theamount of consolidation-relevant statements collected (discussedpresently) where the changes in the total amounts of coreferentidentifiers equivalence classes and canonicalised terms were infact negligible (lt01) Thus for our corpus extracting only as-serted consolidation-relevant statements offered a very closeapproximation of the extended reasoning approach21

2

qw

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 13: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

88 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Data corpora We provide new evaluation over our 1118 billionquadruple evaluation corpus presenting new performance resultsover the full corpus and for a varying number of slave machinesWe also analyse in depth the results of applying the techniquesover our evaluation corpus analysing the applicability of ourmethods in such scenarios We contrast the results given by thebaseline and extended consolidation approaches and again manu-ally evaluate the results of the extended consolidation approach for1000 pairs found to be coreferent

51 High-level approach

In Table 2 we provide the pertinent rules for inferring new owl

sameAs relations from the data However after analysis of thedata we observed that no documents used the owlmaxQuali-

fiedCardinality construct required for the cls-maxqc rulesand that only one document defined one owlhasKey axiom15

involving properties with less than five occurrences in the datamdashhence we leave implementation of these rules for future work andnote that these new OWL 2 constructs have probably not yet had timeto find proper traction on the Web Thus on top of inferencing involv-ing explicit owlsameAs we are left with rule prp-fp which supportsthe semantics of properties typed owlFunctionalProperty andrule prp-ifp which supports the semantics of properties typedowlInverseFunctionalProperty and rule cls-maxc1 whichsupports the semantics of classes with a specified cardinality of 1for some defined property (a class restricted version of the func-tional-property inferencing)16

Thus we look at using OWL 2 RLRDF rules prp-fp prp-ifp andcls-maxc2 for inferring new owl sameAs relations between indi-vidualsmdashwe also support an additional rule that gives an exact car-dinality version of cls-maxc217 Herein we refer to these rules asconsolidation rules

We also investigate pre-applying more general OWL 2 RLRDFreasoning over the corpus to derive more complete results wherethe ruleset is available in Table 3 and is restricted to OWL 2 RLRDFrules with one assertional pattern [33] unlike the consolidationrules these general rules do not require the computation of asser-tional joins and thus are amenable to execution by means of a sin-gle-scan of the corpus We have demonstrated this profile ofreasoning to have good competency with respect to the features ofRDFS and OWL used in Linked Data [29] These general rules mayindirectly lead to the inference of additional owlsameAs relationsparticularly when combined with the consolidation rules of Table 2

Once we have used reasoning to derive novel owlsameAs

data we then apply an on-disk batch processing technique toclose the equivalence relation and to ultimately consolidate thecorpus

Thus our high-level approach is as follows

(i) extract relevant terminological data from the corpus(ii) bind the terminological patterns in the rules from this data

thus creating a larger set of general rules with only oneassertional pattern and identifying assertional patterns thatare useful for consolidation

(iii) apply general-rules over the corpus and buffer any inputinferred statements relevant for consolidation to a new file

(iv) derive the closure of owlsameAs statements from the con-solidation-relevant dataset

15 httphuemerlstadlernetrolerhrdf16 Note that we have presented non-distributed batch-processing execution of

these rules at a smaller scale (147 million quadruples) in [31]17 Exact cardinalities are disallowed in OWL 2 RL due to their effect on the formal

proposition of completeness underlying the profile but such considerations are mootin our scenario

(v) apply consolidation over the main corpus with respect to theclosed owlsameAs data

In Step (i) we extract terminological datamdashrequired for applica-tion of our rulesmdashfrom the main corpus As discussed in Section 25to help ensure robustness we apply authoritative reasoning inother words at this stage we discard any third-party terminologi-cal statements that affect inferencing over instance data for remoteclasses and properties

In Step (ii) we then compile the authoritative terminologicalstatements into the rules to generate a set of T-ground (purelyassertional) rules which can then be applied over the entirety ofthe corpus [33] As mentioned in Section 26 the T-ground generalrules only contain one assertional pattern and thus are amenableto execution by a single scan eg given the statement

homepage rdfssubPropertyOf isPrimaryTopicOf

we would generate a T-ground rulex homepage y

)x isPrimaryTopicOf y

which does not require the computation of joins We also bindthe terminological patterns of the consolidation rules in Table 2however these rules cannot be performed by means of a singlescan over the data since they contain multiple assertional pat-terns in the body Thus we instead extract triple patternssuch as

x isPrimaryTopicOf y

which may indicate data useful for consolidation (note that hereisPrimaryTopicOf is assumed to be inverse-functional)

In Step (iii) we then apply the T-ground general rules over thecorpus with a single scan Any input or inferred statements match-ing a consolidation-relevant patternmdashincluding any owlsameAs

statements foundmdashare buffered to a file for later join computationSubsequently in Step (iv) we must now compute the on-disk

canonicalised closure of the owlsameAs statements In particularwe mainly employ the following three on-disk primitives

(i) sequential scans of flat files containing line-delimitedtuples18

(ii) external-sorts where batches of statements are sorted inmemory the sorted batches written to disk and the sortedbatches merged to the final output and

(iii) merge-joins where multiple sets of data are sorted accordingto their required join position and subsequently scanned inan interleaving manner that aligns on the join position andwhere an in-memory join is applied for each individual joinelement

Using these primitives to compute the owlsameAs closureminimises the amount of main memory required where we havepresented similar approaches in [3031]

First assertional memberships of functional properties inverse-functional properties and cardinality restrictions (both propertiesand classes) are written to separate on-disk files For functional-property and cardinality reasoning a consistent join variable forthe assertional patterns is given by the subject position for in-verse-functional-property reasoning a join variable is given bythe object position19 Thus we can sort the former sets of data

8 These files are G-Zip compressed flat files of N-Triple-like syntax encodingrbitrary length tuples of RDF constants9 Although a predicate-position join is also available we prefer object-positionins that provide smaller batches of data for the in-memory join

1

a1

jo

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 89

according to subject and perform a merge-join by means of a linearscan thereafter the same procedure applies to the latter set sortingand merge-joining on the object position Applying merge-join scanswe produce new owlsameAs statements

Both the originally asserted and newly inferred owlsameAs

relations are similarly written to an on-disk file over which wenow wish to perform the canonicalised symmetrictransitive clo-sure We apply a similar method again leveraging external sortsand merge-joins to perform the computation (herein we sketchand point the interested reader to [31 Section 46]) In the follow-ing we use gt and lt to denote lexical ordering SA as a shortcut forowlsameAs a b c etc to denote members of U [ B such thata lt b lt c and define URIs to be lexically lower than blank nodesThe process is as follows

(i) we only materialise symmetric equality relations thatinvolve a (possibly intermediary) canonical term chosenby a lexical ordering given b SA a we materialise a SA bgiven a SA b SA c we materialise the relations a SA b aSA c and their inverses but do not materialise b SA c orits inverse

(ii) transitivity is supported by iterative merge-join scans

20 One could instead build an on-disk map for equivalence classes and pivoelements and follow a consolidation method similar to the previous section over theunordered data however we would expect such an on-disk index to have a low cachehit-rate given the nature of the data which would lead to many (slow) disk seeks Analternative approach might be to hash-partition the corpus and equality index overdifferent machines however this would require a non-trivial minimum amount omemory on the cluster

ndash in the scan if we find c SA a (sorted by object) and c SA d(sorted naturally) we infer a SA d and drop the non-canonical c SA d (and d SA c)

ndash at the end of the scan newly inferred triples are marked and merge-joined

into the main equality data any triples echoing an earlier inference are ignored dropped non-canonical statements are removed

ndash the process is then iterative in the next scan if we find dSA a and d SA e we infer a SA e and e SA a

ndash inferences will only occur if they involve a statementadded in the previous iteration ensuring that inferencesteps are not re-computed and that the computation willterminate

(iii) the above iterations stop when a fixpoint is reached andnothing new is inferred

(iv) the process of reaching a fixpoint is accelerated using avail-able main-memory to store a cache of partial equalitychains

The above steps follow well-known methods for transtive clo-sure computation modified to support the symmetry ofowlsameAs and to use adaptive canonicalisation in order toavoid quadratic output With respect to the last item we useAlgorithm 1 to derive lsquolsquobatchesrsquorsquo of in-memory equivalencesand when in-memory capacity is achieved we write thesebatches to disk and proceed with on-disk computation this isparticularly useful for computing the small number of longequality chains which would otherwise require sorts andmerge-joins over all of the canonical owlsameAs data currentlyderived and where the number of iterations would otherwise bethe length of the longest chain The result of this process is a setof canonicalised equality relations representing the symmetrictransitive closure

Next we briefly describe the process of canonicalising data withrespect to this on-disk equality closure where we again use exter-nal-sorts and merge-joins First we prune the owlsameAs indexto only maintain relations s1 SA s2 such that s1 gt s2 thus given s1

SA s2 we know that s2 is the canonical identifier and s1 is to berewritten We then sort the data according to the position thatwe wish to rewrite and perform a merge-join over both sets ofdata buffering the canonicalised data to an output file If we wantto rewrite multiple positions of a file of tuples (eg subject and

object) we must rewrite one position sort the results by the sec-ond position and then rewrite the second position20

Finally note that in the derivation of owlsameAs from the con-solidation rules prp-fp prp-ifp cax-maxc2 the overall processmay be iterative For example consider

dblpAxel foafisPrimaryTopicOf lthttppolleresnetgt

axelme foafisPrimaryTopicOf lthttpaxelderiiegt

lthttppolleresnetgt owlsameAs lthttpaxelderiiegt

from which the conclusion that dblpAxel is the same asaxelme holds We see that new owlsameAs relations (either as-serted or derived from the consolidation rules) may in turn lsquolsquoalignrsquorsquoterms in the join position of the consolidation rules leading to newequivalences Thus for deriving the final owlsameAs we require ahigher-level iterative process as follows

(i) initially apply the consolidation rules and append theresults to a file alongside the asserted owl sameAs state-ments found

(ii) apply the initial closure of the owlsameAs data(iii) then iteratively until no new owlsameAs inferences are

found

ndash rewrite the join positions of the on-disk files containing

the data for each consolidation rule according to the cur-rent owlsameAs data

ndash derive new owlsameAs inferences possible through theprevious rewriting for each consolidation rule

ndash re-derive the closure of the owlsameAs data includingthe new inferences

The final closed file of owlsameAs data can then be re-used torewrite the main corpus in two sorts and merge-join scans oversubject and object

52 Distributed approach

The distributed approach follows quite naturally from theprevious discussion As before we assume that the input dataare evenly pre-distributed over the slave machines (in anyarbitrary ordering) where we can then apply the followingprocess

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract relevant terminologicalknowledge

(ii) Gather gather terminological data onto the master machineand thereafter bind the terminological patterns of the gen-eralconsolidation rules

(iii) Flood flood the rules for reasoning and the consolidation-relevant patterns to all slave machines

(iv) Run apply reasoning and extract consolidation-relevantstatements from the input and inferred data

(v) Gather gather all consolidation statements onto the mastermachine then in parallel

t

f

Table 8Breakdown of timing of distributed extended consolidation wreasoning where identifies the masterslave tasks that run concurrently

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 4543 1000 2302 1000 1275 1000 813 1000 7404 1000Executing 299 66 303 131 325 255 301 370 2436 329

Aggregate Consolidation Relevant Data 43 09 45 20 47 37 46 57 299 40Closing owlsameAs

244 54 245 106 254 199 244 300 2112 285Miscellaneous 12 03 13 05 24 19 11 13 25 03

Idle (waiting for slaves) 4245 934 1999 869 950 745 512 630 4968 671

SlaveAvg Executing (total) 4489 988 2212 961 1116 875 565 694 5991 809

Extract Terminology 444 98 227 99 118 92 69 84 494 67Extract Consolidation Relevant Data 955 210 483 210 243 191 132 162 1379 186Initial Sort (by subject) 980 216 483 210 237 186 116 142 1338 181Consolidation 2110 464 1019 443 519 407 249 306 2780 375

Avg Idle 55 12 89 39 379 297 236 290 1413 191Waiting for peers 00 00 32 14 71 56 63 77 375 51Waiting for master 55 12 58 25 308 241 173 212 1038 140

90 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

ndash Local compute the closure of the consolidation rules andthe owlsameAs data on the master machine

ndash Run each slave machine sorts its fragment of the maincorpus according to natural order (spoc)

1 At least in terms of pure quantity However we do not give an indication of theuality or importance of those few equivalences we miss with this approximationhich may well be application specific

(vi) Flood send the closed owlsameAs data to the slavemachines once the distributed sort has been completed

(vii) Run each slave machine then rewrites the subjects of theirsegment of the corpus subsequently sorts the rewritten databy object and then rewrites the objects (of non-rdftypetriples) with respect to the closed owlsameAs data

53 Performance evaluation

Applying the above process to our 1118 billion quadruple corpustook 1234 h Extracting the terminological data took 114 h with anaverage idle time of 19 min (277) (one machine took 18 minlonger than the rest due to processing a large ontology containing2 million quadruples [32]) Merging and aggregating the termino-logical data took roughly 1 min Applying the reasoning andextracting the consolidation relevant statements took 234 h withan average idle time of 25 min (18) Aggregating and mergingthe consolidation relevant statements took 299 min Thereafterlocally computing the closure of the consolidation rules and theequality data took 352 h with the computation requiring two itera-tions overall (the minimum possiblemdashthe second iteration did notproduce any new results) Concurrent to the previous step the paral-lel sort of remote data by natural order took 233 h with an averageidle time of 6 min (43) Subsequent parallel consolidation of thedata took 48 h with 10 min (35) average idle timemdashof this 19of the time was spent consolidating the pre-sorted subjects 60of the time was spent sorting the rewritten data by object and21 of the time was spent consolidating the objects of the data

As before Table 8 summarises the timing of the task Focusing onthe performance for the entire corpus the master machine took406 h to coordinate global knowledge constituting the lower boundon time possible for the task to execute with respect to increasingmachines in our setupmdashin future it may be worthwhile to investigatedistributed strategies for computing the owlsameAs closure(which takes 285 of the total computation time) but for the mo-ment we mitigate the cost by concurrently running a sort on theslave machines thus keeping the slaves busy for 634 of the timetaken for this local aggregation step The slave machines were onaverage busy for 809 of the total task time of the idle time733 was spent waiting for the master machine to aggregate theconsolidation relevant data and to finish the closure of owl sameAs

data and the balance (267) was spent waiting for peers to finish(mostly during the extraction of terminological data)

With regards performance evaluation over the 100 million qua-druple corpus for a varying number of slave machines perhapsmost notably the average idle time of the slave machines increasesquite rapidly as the number of machines increases In particular by4 machines the slaves can perform the distributed sort-by-subjecttask faster than the master machine can perform the local owlsameAs closure The significant workload of the master machinewill eventually become a bottleneck as the number of slave ma-chines increases further This is also noticeable in terms of overalltask time the respective time savings as the number of slave ma-chines doubles (moving from 1 slave machine to 8) are 0507 0554 and 0638 respectively

Briefly we also ran the consolidation without the general rea-soning rules (Table 3) motivated earlier With respect to perfor-mance the main variations were given by (i) the extraction ofconsolidation relevant statementsmdashthis time directly extractedfrom explicit statements as opposed to explicit and inferred state-mentsmdashwhich took 154 min (11 of the time taken including thegeneral reasoning) with an average idle time of less than one min-ute (6 average idle time) (ii) local aggregation of the consolida-tion relevant statements took 17 min (569 of the time takenpreviously) (iii) local closure of the owlsameAs data took318 h (904 of the time taken previously) The total time savedequated to 28h (227) where 333 min were saved from coordi-nation on the master machine and 225 h were saved from parallelexecution on the slave machines

54 Results evaluation

In this section we present the results of the consolidationwhich included the general reasoning step in the extraction ofthe consolidation-relevant statements In fact we found that theonly major variation between the two approaches was in theamount of consolidation-relevant statements collected (discussedpresently) where the changes in the total amounts of coreferentidentifiers equivalence classes and canonicalised terms were infact negligible (lt01) Thus for our corpus extracting only as-serted consolidation-relevant statements offered a very closeapproximation of the extended reasoning approach21

2

qw

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 14: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 89

according to subject and perform a merge-join by means of a linearscan thereafter the same procedure applies to the latter set sortingand merge-joining on the object position Applying merge-join scanswe produce new owlsameAs statements

Both the originally asserted and newly inferred owlsameAs

relations are similarly written to an on-disk file over which wenow wish to perform the canonicalised symmetrictransitive clo-sure We apply a similar method again leveraging external sortsand merge-joins to perform the computation (herein we sketchand point the interested reader to [31 Section 46]) In the follow-ing we use gt and lt to denote lexical ordering SA as a shortcut forowlsameAs a b c etc to denote members of U [ B such thata lt b lt c and define URIs to be lexically lower than blank nodesThe process is as follows

(i) we only materialise symmetric equality relations thatinvolve a (possibly intermediary) canonical term chosenby a lexical ordering given b SA a we materialise a SA bgiven a SA b SA c we materialise the relations a SA b aSA c and their inverses but do not materialise b SA c orits inverse

(ii) transitivity is supported by iterative merge-join scans

20 One could instead build an on-disk map for equivalence classes and pivoelements and follow a consolidation method similar to the previous section over theunordered data however we would expect such an on-disk index to have a low cachehit-rate given the nature of the data which would lead to many (slow) disk seeks Analternative approach might be to hash-partition the corpus and equality index overdifferent machines however this would require a non-trivial minimum amount omemory on the cluster

ndash in the scan if we find c SA a (sorted by object) and c SA d(sorted naturally) we infer a SA d and drop the non-canonical c SA d (and d SA c)

ndash at the end of the scan newly inferred triples are marked and merge-joined

into the main equality data any triples echoing an earlier inference are ignored dropped non-canonical statements are removed

ndash the process is then iterative in the next scan if we find dSA a and d SA e we infer a SA e and e SA a

ndash inferences will only occur if they involve a statementadded in the previous iteration ensuring that inferencesteps are not re-computed and that the computation willterminate

(iii) the above iterations stop when a fixpoint is reached andnothing new is inferred

(iv) the process of reaching a fixpoint is accelerated using avail-able main-memory to store a cache of partial equalitychains

The above steps follow well-known methods for transtive clo-sure computation modified to support the symmetry ofowlsameAs and to use adaptive canonicalisation in order toavoid quadratic output With respect to the last item we useAlgorithm 1 to derive lsquolsquobatchesrsquorsquo of in-memory equivalencesand when in-memory capacity is achieved we write thesebatches to disk and proceed with on-disk computation this isparticularly useful for computing the small number of longequality chains which would otherwise require sorts andmerge-joins over all of the canonical owlsameAs data currentlyderived and where the number of iterations would otherwise bethe length of the longest chain The result of this process is a setof canonicalised equality relations representing the symmetrictransitive closure

Next we briefly describe the process of canonicalising data withrespect to this on-disk equality closure where we again use exter-nal-sorts and merge-joins First we prune the owlsameAs indexto only maintain relations s1 SA s2 such that s1 gt s2 thus given s1

SA s2 we know that s2 is the canonical identifier and s1 is to berewritten We then sort the data according to the position thatwe wish to rewrite and perform a merge-join over both sets ofdata buffering the canonicalised data to an output file If we wantto rewrite multiple positions of a file of tuples (eg subject and

object) we must rewrite one position sort the results by the sec-ond position and then rewrite the second position20

Finally note that in the derivation of owlsameAs from the con-solidation rules prp-fp prp-ifp cax-maxc2 the overall processmay be iterative For example consider

dblpAxel foafisPrimaryTopicOf lthttppolleresnetgt

axelme foafisPrimaryTopicOf lthttpaxelderiiegt

lthttppolleresnetgt owlsameAs lthttpaxelderiiegt

from which the conclusion that dblpAxel is the same asaxelme holds We see that new owlsameAs relations (either as-serted or derived from the consolidation rules) may in turn lsquolsquoalignrsquorsquoterms in the join position of the consolidation rules leading to newequivalences Thus for deriving the final owlsameAs we require ahigher-level iterative process as follows

(i) initially apply the consolidation rules and append theresults to a file alongside the asserted owl sameAs state-ments found

(ii) apply the initial closure of the owlsameAs data(iii) then iteratively until no new owlsameAs inferences are

found

ndash rewrite the join positions of the on-disk files containing

the data for each consolidation rule according to the cur-rent owlsameAs data

ndash derive new owlsameAs inferences possible through theprevious rewriting for each consolidation rule

ndash re-derive the closure of the owlsameAs data includingthe new inferences

The final closed file of owlsameAs data can then be re-used torewrite the main corpus in two sorts and merge-join scans oversubject and object

52 Distributed approach

The distributed approach follows quite naturally from theprevious discussion As before we assume that the input dataare evenly pre-distributed over the slave machines (in anyarbitrary ordering) where we can then apply the followingprocess

(i) Run scan the distributed corpus (split over the slavemachines) in parallel to extract relevant terminologicalknowledge

(ii) Gather gather terminological data onto the master machineand thereafter bind the terminological patterns of the gen-eralconsolidation rules

(iii) Flood flood the rules for reasoning and the consolidation-relevant patterns to all slave machines

(iv) Run apply reasoning and extract consolidation-relevantstatements from the input and inferred data

(v) Gather gather all consolidation statements onto the mastermachine then in parallel

t

f

Table 8Breakdown of timing of distributed extended consolidation wreasoning where identifies the masterslave tasks that run concurrently

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 4543 1000 2302 1000 1275 1000 813 1000 7404 1000Executing 299 66 303 131 325 255 301 370 2436 329

Aggregate Consolidation Relevant Data 43 09 45 20 47 37 46 57 299 40Closing owlsameAs

244 54 245 106 254 199 244 300 2112 285Miscellaneous 12 03 13 05 24 19 11 13 25 03

Idle (waiting for slaves) 4245 934 1999 869 950 745 512 630 4968 671

SlaveAvg Executing (total) 4489 988 2212 961 1116 875 565 694 5991 809

Extract Terminology 444 98 227 99 118 92 69 84 494 67Extract Consolidation Relevant Data 955 210 483 210 243 191 132 162 1379 186Initial Sort (by subject) 980 216 483 210 237 186 116 142 1338 181Consolidation 2110 464 1019 443 519 407 249 306 2780 375

Avg Idle 55 12 89 39 379 297 236 290 1413 191Waiting for peers 00 00 32 14 71 56 63 77 375 51Waiting for master 55 12 58 25 308 241 173 212 1038 140

90 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

ndash Local compute the closure of the consolidation rules andthe owlsameAs data on the master machine

ndash Run each slave machine sorts its fragment of the maincorpus according to natural order (spoc)

1 At least in terms of pure quantity However we do not give an indication of theuality or importance of those few equivalences we miss with this approximationhich may well be application specific

(vi) Flood send the closed owlsameAs data to the slavemachines once the distributed sort has been completed

(vii) Run each slave machine then rewrites the subjects of theirsegment of the corpus subsequently sorts the rewritten databy object and then rewrites the objects (of non-rdftypetriples) with respect to the closed owlsameAs data

53 Performance evaluation

Applying the above process to our 1118 billion quadruple corpustook 1234 h Extracting the terminological data took 114 h with anaverage idle time of 19 min (277) (one machine took 18 minlonger than the rest due to processing a large ontology containing2 million quadruples [32]) Merging and aggregating the termino-logical data took roughly 1 min Applying the reasoning andextracting the consolidation relevant statements took 234 h withan average idle time of 25 min (18) Aggregating and mergingthe consolidation relevant statements took 299 min Thereafterlocally computing the closure of the consolidation rules and theequality data took 352 h with the computation requiring two itera-tions overall (the minimum possiblemdashthe second iteration did notproduce any new results) Concurrent to the previous step the paral-lel sort of remote data by natural order took 233 h with an averageidle time of 6 min (43) Subsequent parallel consolidation of thedata took 48 h with 10 min (35) average idle timemdashof this 19of the time was spent consolidating the pre-sorted subjects 60of the time was spent sorting the rewritten data by object and21 of the time was spent consolidating the objects of the data

As before Table 8 summarises the timing of the task Focusing onthe performance for the entire corpus the master machine took406 h to coordinate global knowledge constituting the lower boundon time possible for the task to execute with respect to increasingmachines in our setupmdashin future it may be worthwhile to investigatedistributed strategies for computing the owlsameAs closure(which takes 285 of the total computation time) but for the mo-ment we mitigate the cost by concurrently running a sort on theslave machines thus keeping the slaves busy for 634 of the timetaken for this local aggregation step The slave machines were onaverage busy for 809 of the total task time of the idle time733 was spent waiting for the master machine to aggregate theconsolidation relevant data and to finish the closure of owl sameAs

data and the balance (267) was spent waiting for peers to finish(mostly during the extraction of terminological data)

With regards performance evaluation over the 100 million qua-druple corpus for a varying number of slave machines perhapsmost notably the average idle time of the slave machines increasesquite rapidly as the number of machines increases In particular by4 machines the slaves can perform the distributed sort-by-subjecttask faster than the master machine can perform the local owlsameAs closure The significant workload of the master machinewill eventually become a bottleneck as the number of slave ma-chines increases further This is also noticeable in terms of overalltask time the respective time savings as the number of slave ma-chines doubles (moving from 1 slave machine to 8) are 0507 0554 and 0638 respectively

Briefly we also ran the consolidation without the general rea-soning rules (Table 3) motivated earlier With respect to perfor-mance the main variations were given by (i) the extraction ofconsolidation relevant statementsmdashthis time directly extractedfrom explicit statements as opposed to explicit and inferred state-mentsmdashwhich took 154 min (11 of the time taken including thegeneral reasoning) with an average idle time of less than one min-ute (6 average idle time) (ii) local aggregation of the consolida-tion relevant statements took 17 min (569 of the time takenpreviously) (iii) local closure of the owlsameAs data took318 h (904 of the time taken previously) The total time savedequated to 28h (227) where 333 min were saved from coordi-nation on the master machine and 225 h were saved from parallelexecution on the slave machines

54 Results evaluation

In this section we present the results of the consolidationwhich included the general reasoning step in the extraction ofthe consolidation-relevant statements In fact we found that theonly major variation between the two approaches was in theamount of consolidation-relevant statements collected (discussedpresently) where the changes in the total amounts of coreferentidentifiers equivalence classes and canonicalised terms were infact negligible (lt01) Thus for our corpus extracting only as-serted consolidation-relevant statements offered a very closeapproximation of the extended reasoning approach21

2

qw

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 15: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

Table 8Breakdown of timing of distributed extended consolidation wreasoning where identifies the masterslave tasks that run concurrently

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 4543 1000 2302 1000 1275 1000 813 1000 7404 1000Executing 299 66 303 131 325 255 301 370 2436 329

Aggregate Consolidation Relevant Data 43 09 45 20 47 37 46 57 299 40Closing owlsameAs

244 54 245 106 254 199 244 300 2112 285Miscellaneous 12 03 13 05 24 19 11 13 25 03

Idle (waiting for slaves) 4245 934 1999 869 950 745 512 630 4968 671

SlaveAvg Executing (total) 4489 988 2212 961 1116 875 565 694 5991 809

Extract Terminology 444 98 227 99 118 92 69 84 494 67Extract Consolidation Relevant Data 955 210 483 210 243 191 132 162 1379 186Initial Sort (by subject) 980 216 483 210 237 186 116 142 1338 181Consolidation 2110 464 1019 443 519 407 249 306 2780 375

Avg Idle 55 12 89 39 379 297 236 290 1413 191Waiting for peers 00 00 32 14 71 56 63 77 375 51Waiting for master 55 12 58 25 308 241 173 212 1038 140

90 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

ndash Local compute the closure of the consolidation rules andthe owlsameAs data on the master machine

ndash Run each slave machine sorts its fragment of the maincorpus according to natural order (spoc)

1 At least in terms of pure quantity However we do not give an indication of theuality or importance of those few equivalences we miss with this approximationhich may well be application specific

(vi) Flood send the closed owlsameAs data to the slavemachines once the distributed sort has been completed

(vii) Run each slave machine then rewrites the subjects of theirsegment of the corpus subsequently sorts the rewritten databy object and then rewrites the objects (of non-rdftypetriples) with respect to the closed owlsameAs data

53 Performance evaluation

Applying the above process to our 1118 billion quadruple corpustook 1234 h Extracting the terminological data took 114 h with anaverage idle time of 19 min (277) (one machine took 18 minlonger than the rest due to processing a large ontology containing2 million quadruples [32]) Merging and aggregating the termino-logical data took roughly 1 min Applying the reasoning andextracting the consolidation relevant statements took 234 h withan average idle time of 25 min (18) Aggregating and mergingthe consolidation relevant statements took 299 min Thereafterlocally computing the closure of the consolidation rules and theequality data took 352 h with the computation requiring two itera-tions overall (the minimum possiblemdashthe second iteration did notproduce any new results) Concurrent to the previous step the paral-lel sort of remote data by natural order took 233 h with an averageidle time of 6 min (43) Subsequent parallel consolidation of thedata took 48 h with 10 min (35) average idle timemdashof this 19of the time was spent consolidating the pre-sorted subjects 60of the time was spent sorting the rewritten data by object and21 of the time was spent consolidating the objects of the data

As before Table 8 summarises the timing of the task Focusing onthe performance for the entire corpus the master machine took406 h to coordinate global knowledge constituting the lower boundon time possible for the task to execute with respect to increasingmachines in our setupmdashin future it may be worthwhile to investigatedistributed strategies for computing the owlsameAs closure(which takes 285 of the total computation time) but for the mo-ment we mitigate the cost by concurrently running a sort on theslave machines thus keeping the slaves busy for 634 of the timetaken for this local aggregation step The slave machines were onaverage busy for 809 of the total task time of the idle time733 was spent waiting for the master machine to aggregate theconsolidation relevant data and to finish the closure of owl sameAs

data and the balance (267) was spent waiting for peers to finish(mostly during the extraction of terminological data)

With regards performance evaluation over the 100 million qua-druple corpus for a varying number of slave machines perhapsmost notably the average idle time of the slave machines increasesquite rapidly as the number of machines increases In particular by4 machines the slaves can perform the distributed sort-by-subjecttask faster than the master machine can perform the local owlsameAs closure The significant workload of the master machinewill eventually become a bottleneck as the number of slave ma-chines increases further This is also noticeable in terms of overalltask time the respective time savings as the number of slave ma-chines doubles (moving from 1 slave machine to 8) are 0507 0554 and 0638 respectively

Briefly we also ran the consolidation without the general rea-soning rules (Table 3) motivated earlier With respect to perfor-mance the main variations were given by (i) the extraction ofconsolidation relevant statementsmdashthis time directly extractedfrom explicit statements as opposed to explicit and inferred state-mentsmdashwhich took 154 min (11 of the time taken including thegeneral reasoning) with an average idle time of less than one min-ute (6 average idle time) (ii) local aggregation of the consolida-tion relevant statements took 17 min (569 of the time takenpreviously) (iii) local closure of the owlsameAs data took318 h (904 of the time taken previously) The total time savedequated to 28h (227) where 333 min were saved from coordi-nation on the master machine and 225 h were saved from parallelexecution on the slave machines

54 Results evaluation

In this section we present the results of the consolidationwhich included the general reasoning step in the extraction ofthe consolidation-relevant statements In fact we found that theonly major variation between the two approaches was in theamount of consolidation-relevant statements collected (discussedpresently) where the changes in the total amounts of coreferentidentifiers equivalence classes and canonicalised terms were infact negligible (lt01) Thus for our corpus extracting only as-serted consolidation-relevant statements offered a very closeapproximation of the extended reasoning approach21

2

qw

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 16: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

Table 9Top ten most frequently occurring blacklisted values

Blacklisted term Occurrences

1 Empty literals 5847352 lthttpnullgt 4140883 lthttpwwwvoxcomgonegt 1504024 lsquolsquo08445a31a78661b5c746feff39a9db6e4e2cc5cfrsquorsquo 583385 lthttpwwwfacebookcomgt 69886 lthttpfacebookcomgt 54627 lthttpwwwgooglecomgt 22348 lthttpwwwfacebookcomgt 12549 lthttpgooglecomgt 110810 lthttpnullcomgt 542

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100 1000 10000 100000

num

ber o

f cla

sses

equivalence class size

baseline consolidationreasoning consolidation

Fig 5 Distribution of the number of identifiers per equivalence classes for baselineconsolidation and extended reasoning consolidation (loglog)

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 91

Extracting the terminological data we found authoritative dec-larations of 434 functional properties 57 inverse-functional prop-erties and 109 cardinality restrictions with a value of 1 Wediscarded some third-party non-authoritative declarations of in-verse-functional or functional properties many of which were sim-ply lsquolsquoechoingrsquorsquo the authoritative semantics of those properties egmany people copy the FOAF vocabulary definitions into their localdocuments We also found some other documents on the bio2rd-forg domain that define a number of popular third-party proper-ties as being functional including dctitle dcidentifierfoafname etc22 However these particular axioms would only af-fect consolidation for literals thus we can say that if applying onlythe consolidation rules also considering non-authoritative defini-tions would not affect the owlsameAs inferences for our corpusAgain non-authoritative reasoning over general rules for a corpussuch as ours quickly becomes infeasible [9]

We (again) gathered 377 million owlsameAs triples as well as5293 million memberships of inverse-functional properties 1109million memberships of functional properties and 256 millioncardinality-relevant triples Of these respectively 2214 million(418) 117 million (106) and 533 thousand (208) were as-sertedmdashhowever in the resulting closed owlsameAs data derivedwith and without the extra reasoned triples we detected a varia-tion of less than 12 thousand terms (008) where only 129 wereURIs and where other variations in statistics were less than 01(eg there were 67 less equivalence classes when the reasoned tri-ples were included)

From previous experiments for older RDF Web data [30] wewere aware of certain values for inverse-functional propertiesand functional properties that are erroneously published byexporters and which cause massive incorrect consolidation Weagain blacklist statements featuring such values from our consoli-dation processing where we give the top 10 such values encoun-tered for our corpus in Table 9mdashthis blacklist is the result of trialand error manually inspecting large equivalence classes and themost common values for (inverse-)functional properties Emptyliterals are commonly exported (with and without language tags)as values for inverse-functional-properties (particularly FOAFlsquolsquochat-ID propertiesrsquorsquo) The literal 08445a31a78661b5c746fef-f39a9db6e4e2cc5cfis the SHA-1 hash of the string lsquomailto rsquocommonly assigned as a foafmbox_sha1sum value to userswho do not specify their email in some input form The remainingURIs are mainly user-specified values for foafhomepage or val-ues automatically assigned for users that do not specify such23

During the computation of the owlsameAs closure we foundzero inferences through cardinality rules 1068 thousand rawowl sameAs inferences through functional-property reasoningand 87 million raw owlsameAs inferences through inverse-func-

22 cf httpbio2rdforgnsbio2rdfTopic offline as of 2011062023 Our full blacklist contains forty-one such values and can be found at http

aidanhogancomswseblacklisttxt 24 This site shut down on 20100930

tional-property reasoning The final canonicalised closed and non-symmetric owlsameAs index (such that s1 SA s2 s1 gt s2 and s2 is acanonical identifier) contained 1203 million statements

From this data we generated 282 million equivalence classes(an increase of 131 from baseline consolidation) mentioning atotal of 1486 million terms (an increase of 258 from baselinemdash577 of all URIs and blank-nodes) of which 903 million wereblank-nodes (an increase of 2173 from baselinemdash546 of allblank-nodes) and 583 million were URIs (an increase of 1014from baselinemdash633 of all URIs) Thus we see a large expansionin the amount of blank-nodes consolidated but only minimalexpansion in the set of URIs referenced in the equivalence classesWith respect to the canonical identifiers 641 thousand (227)were blank-nodes and 218 million (773) were URIs

Fig 5 contrasts the equivalence class sizes for the baseline ap-proach (seen previously in Fig 3) and for the extended reasoningapproach Overall there is an observable increase in equivalenceclass sizes where we see the average equivalence class size growto 526 entities (198 baseline) the largest equivalence class sizegrow to 33052 (39 baseline) and the percentage of equivalenceclasses with the minimum size 2 drop to 631 (from 741 inbaseline)

In Table 10 we update the five largest equivalence classesResult 2 carries over from the baseline consolidation The rest ofthe results are largely intra-PLD equivalences where the entity isdescribed using thousands of blank-nodes with a consistent(inverse-)functional property value attached Result 1 refers to ameta-usermdashlabelled Team Voxmdashcommonly appearing in user-FOAFexports on the Vox blogging platform24 Result 3 refers to a personidentified using blank-nodes (and once by URI) in thousands of RDFdocuments resident on the same server Result 4 refers to the ImageBioinformatics Research Group in the University of OxfordmdashlabelledIBRGmdashwhere again it is identified in thousands of documents usingdifferent blank-nodes but a consistent foafhomepage Result 5 issimilar to result 1 but for a Japanese version of the Vox user

We performed an analogous manual inspection of one thousandpairs as per the sampling used previously in Section 44 for thebaseline consolidation results This time we selected SAME 823(823) TRIVIALLY SAME 145 (145) DIFFERENT 23 (23) and UNCLEAR

9 (09) This gives an observed precision for the sample of977 (vs 972 for baseline) After applying our blacklistingthe extended consolidation results are in fact slightly more pre-cise (on average) than the baseline equivalences this is due in part

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 17: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

Table 10Largest five equivalence classes after extended consolidation

Canonical term (lexically lowest in equivalence class) Size Correct

1 bnode37httpa12iggymomvoxcomprofilefoafrdf 33052 U

2 httpbio2rdforgdailymed_drugs1000 84813 httpajftorgrdffoafrdf_me 8140 U

4 bnode4http174129121408080tcmdataassociation100 4395 U

5 bnode1httpaaa977voxcomprofilefoafrdf 1977 U

1

10

100

1000

10000

100000

1e+006

1e+007

1 10 100

num

ber o

f cla

sses

number of PLDs in equivalence class

baseline consolidationreasoning consolidation

Fig 6 Distribution of the number of PLDs per equivalence class for baselineconsolidation and extended reasoning consolidation (loglog)

Table 11Number of equivalent identifiers found for the top-five ranked entities in SWSE withrespect to baseline consolidation (BL) and reasoning consolidation (R)

Canonical Identifier BL R

1 lthttpdblpl3sdeTim_Berners-Leegt 26 502 ltgeniddanbrigt 10 1383 lthttpupdatestatusnetgt 0 04 lthttpwwwldoddscomfoaffoaf-a-maticgt 0 05 lthttpupdatestatusnetuser1acctgt 0 6

1

10

100

1000

10000

100000

1e+006

1e+007

1e+008

1e+009

1 10 100

num

ber o

f ter

ms

number of PLDs mentioning blank-nodeURI term

baseline consolidatedraw data

reasoning consolidated

Fig 7 Distribution of number of PLDs the terms are referenced by for the rawbaseline consolidated and reasoning consolidated data (loglog)

92 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

to a high number of blank-nodes being consolidated correctly fromvery uniform exporters of FOAF data that lsquolsquodilutersquorsquo the incorrectresults25

Fig 6 presents a similar analysis to Fig 5 this time looking atidentifiers on a PLD-level granularity Interestingly the differencebetween the two approaches is not so pronounced initially indi-cating that many of the additional equivalences found throughthe consolidation rules are lsquolsquointra-PLDrsquorsquo In the baseline consolida-tion approach we determined that 57 of equivalence classes wereinter-PLD (contain identifiers from more than one PLD) with theplurality of equivalence classes containing identifiers from pre-cisely two PLDs (951thousand 441) this indicates that explicitowlsameAs relations are commonly asserted between PLDs Inthe extended consolidation approach (which of course subsumesthe above results) we determined that the percentage of inter-PLD equivalence classes dropped to 436 with the majority ofequivalence classes containing identifiers from only one PLD(159 million 564) The entity with the most diverse identifiers

25 We make these and later results available for review at httpswsederiorgentity

(the observable outlier on the x-axis in Fig 6) was the personlsquolsquoDan Brickleyrsquorsquomdashone of the founders and leading contributors ofthe FOAF projectmdashwith 138 identifiers (67 URIs and 71 blank-nodes) minted in 47 PLDs various other prominent communitymembers and some country identifiers also featured high on thelist

In Table 11 we compare the consolidation of the top five rankedidentifiers in the SWSE system (see [32]) The results refer respec-tively to (i) the (co-)founder of the Web lsquolsquoTim Berners-Leersquorsquo (ii)lsquolsquoDan Brickleyrsquorsquo as aforementioned (iii) a meta-user for the mi-cro-blogging platform StatusNet which exports RDF (iv) thelsquolsquoFOAF-a-maticrsquorsquo FOAF profile generator (linked from many diversedomains hosting FOAF profiles it created) and (v) lsquolsquoEvan Prodro-moursquorsquo founder of the identicaStatusNet micro-blogging serviceand platform We see a significant increase in equivalent identifiersfound for these results however we also noted that after reason-ing consolidation Dan Brickley was conflated with a secondperson26

Note that the most frequently co-occurring PLDs in our equiva-lence classes remained unchanged from Table 7

During the rewrite of the main corpus terms in 15177 millionsubject positions (1358 of all subjects) and 3216 million objectpositions (353 of non-rdftype objects) were rewritten givinga total of 18393 million positions rewritten (18 the baselineconsolidation approach) In Fig 7 we compare the re-use of termsacross PLDs before consolidation after baseline consolidation andafter the extended reasoning consolidation Again although thereis an increase in re-use of identifiers across PLDs we note that

6 Domenico Gendarmi with three URIsmdashone document assigns one of Danrsquosoafmbox_sha1sum values (for danbriw3org) to Domenico httpfoafbuild-rqdoscompeoplemyriamleggieriwordpresscomfoafrdf

2

f

e

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 18: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 93

(i) the vast majority of identifiers (about 99) still only appear inone PLD (ii) the difference between the baseline and extended rea-soning approach is not so pronounced The most widely referencedconsolidated entitymdashin terms of unique PLDsmdashwas lsquolsquoEvan Prodro-moursquorsquo as aformentioned referenced with six equivalent URIs in101 distinct PLDs

In summary we conclude that applying the consolidation rulesdirectly (without more general reasoning) is currently a goodapproximation for Linked Data and that in comparison to the base-line consolidation over explicit owlsameAs (i) the additional con-solidation rules generate a large bulk of intra-PLD equivalences forblank-nodes27 (ii) relatedly there is only a minor expansion(1014) in the number of URIs involved in the consolidation (iii)with blacklisting the overall precision remains roughly stable

6 Statistical concurrence analysis

Having looked extensively at the subject of consolidating LinkedData entities in this section we now introduce methods for deriv-ing a weighted concurrence score between entities in the LinkedData corpus we define entity concurrence as the sharing of out-links inlinks and attribute values denoting a specific form of sim-ilarity Conceptually concurrence generates scores between twoentities based on their shared inlinks outlinks or attribute valuesMore lsquolsquodiscriminatingrsquorsquo shared characteristics lend a higher concur-rence score for example two entities based in the same village aremore concurrent than two entities based in the same country Howdiscriminating a certain characteristic is is determined throughstatistical selectivity analysis The more characteristics two entitiesshare (typically) the higher their concurrence score will be

We use these concurrence measures to materialise new linksbetween related entities thus increasing the interconnectednessof the underlying corpus We also leverage these concurrence mea-sures in Section 7 for disambiguating entities where after identi-fying that an equivalence class is likely erroneous and causinginconsistency we use concurrence scores to realign the most sim-ilar entities

In fact we initially investigated concurrence as a speculativemeans of increasing the recall of consolidation for Linked Datathe core premise of this investigation was that very similar enti-tiesmdashwith higher concurrence scoresmdashare likely to be coreferentAlong these lines the methods described herein are based on pre-liminary works [34] where we

ndash investigated domain-agnostic statistical methods for perform-ing consolidation and identifying equivalent entities

ndash formulated an initial small-scale (56 million triples) evaluationcorpus for the statistical consolidation using reasoning consoli-dation as a best-effort lsquolsquogold-standardrsquorsquo

Our evaluation gave mixed results where we found some corre-lation between the reasoning consolidation and the statisticalmethods but we also found that our methods gave incorrect re-sults at high degrees of confidence for entities that were clearlynot equivalent but shared many links and attribute values in com-mon This highlights a crucial fallacy in our speculative approachin almost all cases even the highest degree of similarityconcur-rence does not necessarily indicate equivalence or co-reference(Section 44 cf [27]) Similar philosophical issues arise with respectto handling transitivity for the weighted lsquolsquoequivalencesrsquorsquo derived[1639]

27 We believe this to be due to FOAF publishing practices whereby a given exporteruses consistent inverse-functional property values instead of URIs to uniquelyidentify entities across local documents

However deriving weighted concurrence measures has applica-tions other than approximative consolidation in particular we canmaterialise named relationships between entities that share a lotin common thus increasing the level of inter-linkage between enti-ties in the corpus Again as we will see later we can leverage theconcurrence metrics to repair equivalence classes found to be erro-neous during the disambiguation step of Section 7 Thus we pres-ent a modified version of the statistical analysis presented in [34]describe a (novel) scalable and distributed implementation thereofand finally evaluate the approach with respect to finding highly-concurring entities in our 1 billion triple Linked Data corpus

61 High-level approach

Our statistical concurrence analysis inherits similar primaryrequirements to that imposed for consolidation the approachshould be scalable fully automatic and domain agnostic to beapplicable in our scenario Similarly with respect to secondary cri-teria the approach should be efficient to compute should givehigh precision and should give high recall Compared to consoli-dation high precision is not as critical for our statistical use-casein SWSE we aim to use concurrence measures as a means of sug-gesting additional navigation steps for users browsing the enti-tiesmdashif the suggestion is uninteresting it can be ignored whereasincorrect consolidation will often lead to conspicuously garbled re-sults aggregating data on multiple disparate entities

Thus our requirements (particularly for scale) preclude the pos-sibility of complex analyses or any form of pair-wise comparisonetc Instead we aim to design lightweight methods implementableby means of distributed sorts and scans over the corpus Our meth-ods are designed around the following intuitions and assumptions

(i) the concurrence of entities is measured as a function of theirshared pairs be they predicatendashsubject (loosely inlinks) orpredicatendashobject pairs (loosely outlinks or attribute values)

(ii) the concurrence measure should give a higher weight toexclusive shared-pairsmdashpairs that are typically shared byfew entities for edges (predicates) that typically have alow in-degreeout-degree

(iii) with the possible exception of correlated pairs each addi-tional shared pair should increase the concurrence of the enti-tiesmdasha shared pair cannot reduce the measured concurrenceof the sharing entities

(iv) strongly exclusive property-pairs should be more influentialthan a large set of weakly exclusive pairs

(v) correlation may exist between shared pairsmdasheg two entitiesmay share an inlink and an inverse-outlink to the same node(eg foafdepiction foafdepicts) or may share alarge number of shared pairs for a given property (eg twoentities co-authoring one paper are more likely to co-authorsubsequent papers)mdashwhere we wish to dampen the cumula-tive effect of correlation in the concurrence analysis

(vi) the relative value of the concurrence measure is importantthe absolute value is unimportant

In fact the concurrence analysis follows a similar principle tothat for consolidation where instead of considering discrete func-tional and inverse-functional properties as given by the semanticsof the data we attempt to identify properties that are quasi-func-tional quasi-inverse-functional or what we more generally termexclusive we determine the degree to which the values of proper-ties (here abstracting directionality) are unique to an entity or setof entities The concurrence between two entities then becomes anaggregation of the weights for the propertyndashvalue pairs they sharein common Consider the following running-example

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 19: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

94 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

dblpAliceB10 foafmaker exAlice

dblpAliceB10 foafmaker exBob

exAlice foafgender lsquolsquofemalersquorsquoexAlice foafworkplaceHomepage lthttpwonderlandcomgt

exBob foafgender lsquolsquomalersquorsquoexBob foafworkplaceHomepage lthttpwonderlandcomgt

exClaire foafgender lsquolsquofemalersquorsquoexClaire foafworkplaceHomepage lthttpwonderlandcomgt

where we want to determine the level of (relative) concurrence be-tween three colleagues exAlice exBob and exClaire iehow much do they coincideconcur with respect to exclusive sharedpairs

611 Quantifying concurrenceFirst we want to characterise the uniqueness of properties

thus we analyse their observed cardinality and inverse-cardinalityas found in the corpus (in contrast to their defined cardinality aspossibly given by the formal semantics)

Definition 1 (Observed cardinality) Let G be an RDF graph p be aproperty used as a predicate in G and s be a subject in G Theobserved cardinality (or henceforth in this section simply cardinal-ity) of p wrt s in G denoted CardG(ps) is the cardinality of the seto 2 Cj(spo) 2 G

Definition 2 (Observed inverse-cardinality) Let G and p be asbefore and let o be an object in G The observed inverse-cardinality(or henceforth in this section simply inverse-cardinality) of p wrt oin G denoted ICardG(po) is the cardinality of the sets 2 U [ Bj(spo) 2 G

Thus loosely the observed cardinality of a property-subjectpair is the number of unique objects it appears within the graph(or unique triples it appears in) letting Gex denote our examplegraph then eg CardGex (foafmakerdblpAliceB10) = 2 Wesee this value as a good indicator of how exclusive (or selective) agiven property-subject pair is where sets of entities appearing inthe object position of low-cardinality pairs are considered to con-cur more than those appearing with high-cardinality pairs The ob-served inverse-cardinality of a property-object pair is the numberof unique subjects it appears with in the graphmdasheg ICardGex (foafgenderfemale) = 2 Both directions are considered analogousfor deriving concurrence scoresmdashnote however that we do not con-sider concurrence for literals (ie we do not derive concurrence forliterals that share a given predicatendashsubject pair we do of courseconsider concurrence for subjects with the same literal value fora given predicate)

To avoid unnecessary duplication we henceforth focus ondescribing only the inverse-cardinality statistics of a propertywhere the analogous metrics for plain-cardinality can be derivedby switching subject and object (that is switching directional-ity)mdashwe choose the inverse direction as perhaps being more intu-itive indicating concurrence of entities in the subject positionbased on the predicatendashobject pairs they share

Definition 3 (Average inverse-cardinality) Let G be an RDF graphand p be a property used as a predicate in G The average inverse-cardinality (AIC) of p with respect to G written AICG(p) is theaverage of the non-zero inverse-cardinalities of p in the graph GFormally

AICGethpTHORN frac14j feths oTHORN j eths p oTHORN 2 Gg jj fo j 9s ethsp oTHORN 2 Gg j

The average cardinality ACG(p) of a property p is defined analo-

gously as for AICG(p) Note that the (inverse-)cardinality value ofany term appearing as a predicate in the graph is necessarily great-er-than or equal-to one the numerator is by definition greater-than or equal-to the denominator Taking an example AICGex (foafgender)=15 which can be viewed equivalently as the averagenon-zero cardinalities of foafgender (1 for male and 2 forfemale) or the number of triples with predicate foaf genderdivided by the number of unique values appearing in the object po-sition of such triples (3

2)We call a property p for which we observe AICG(p) 1 a quasi-

inverse-functional property with respect to the graph G and analo-gously properties for which we observe ACG(p) 1 as quasi-func-tional properties We see the values of such propertiesmdashin theirrespective directionsmdashas being very exceptional very rarelyshared by entities Thus we would expect a property such as foafgender to have a high AICG(p) since there are only two object-val-ues (male female) shared by a large number of entitieswhereas we would expect a property such as foafworkplace-

Homepage to have a lower AICG(p) since there are arbitrarily manyvalues to be shared amongst the entities given such an observa-tion we then surmise that a shared foafgender value representsa weaker lsquolsquoindicatorrsquorsquo of concurrence than a shared value forfoafworkplaceHomepage

Given that we deal with incomplete information under the OpenWorld Assumption underlying RDF(S)OWL we also wish to weighthe average (inverse) cardinality values for properties with a lownumber of observations towards a global meanmdashconsider a fictional property exmaritalStatus for which we onlyencounter a few predicate-usages in a given graph and consider twoentities given the value married given sparse inverse-cardinal-ity observations we may naiumlvely over-estimate the significance ofthis property-object pair as an indicator for concurrence Thus weuse a credibility formula as follows to weight properties with fewobservations towards a global mean as follows

Definition 4 (Adjusted average inverse-cardinality) Let p be aproperty appearing as a predicate in the graph G The adjustedaverage inverse-cardinality (AAIC) of p with respect to G writtenAAICG(p) is then

AAICGethpTHORN frac14AICGethpTHORN j G p j thornAICG j G j

j G p j thorn j G jeth1THORN

where jG pj is the number of distinct objects that appear in a triplewith p as a predicate (the denominator of Definition 3) AICG is theaverage inverse-cardinality for all predicatendashobject pairs (formallyAICG frac14 jGj

jfethpoTHORNj9sethspoTHORN2Ggj) and jGj is the average number of distinct

objects for all predicates in the graph (formally j G j frac14jfethpoTHORNj9sethspoTHORN2Ggjjfpj9s9oethspoTHORN2GgjTHORN

Again the adjusted average cardinality AACG(p) of a property pis defined analogously as for AAICG(p) Some reduction of Eq (1) is

possible if one considers that AICGethpTHORN jG pj frac14 jfeths oTHORNjeths p oTHORN2 Ggj also denotes the number of triples for which p appears as a

predicate in graph G and that AICG j Gj frac14 jGjjfpj9s9oethspoTHORN2Ggj also de-

notes the average number of triples per predicate We maintain Eq

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 20: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

swpersonstefan-decker

swpersonandreas-harth

swpaper140

swpaper2221

swpaper3403

dbpediaIreland

sworgnui-galway

sworgderi-nui-galway

foafmade

swrcaffiliaton

foafbased_near

dccreator swrcauthorfoafmaker

foafmade

dccreatorswrcauthorfoafmaker

foafmadedccreator

swrcauthorfoafmaker

foafmember

foafmember

swrcaffiliaton

Fig 8 Example of same-value inter-property and intra-property correlation where the two entities under comparison are highlighted in the dashed box and where thelabels of inward-edges (with respect to the principal entities) are italicised and underlined (from [34])

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 95

(1) in the given unreduced form as it more clearly corresponds tothe structure of a standard credibility formula the reading(AICG(p)) is dampened towards a mean (AICG) by a factor deter-mined by the size of the sample used to derive the reading

(jG pj) relative to the average sample size (jGj)Now we move towards combining these metrics to determine

the concurrence of entities who share a given non-empty set ofpropertyndashvalue pairs To do so we combine the adjusted average(inverse) cardinality values which apply generically to propertiesand the (inverse) cardinality values which apply to a given prop-ertyndashvalue pair For example take the property foafworkplace-Homepage entities that share a value referential to a largecompanymdasheg httpgooglecommdashshould not gain as muchconcurrence as entities that share a value referential to a smallercompanymdasheg httpderiie Conversely consider a fictionalproperty excitizenOf which relates a citizen to its country andfor which we find many observations in our corpus returning ahigh AAIC value and consider that only two entities share the va-lue exVanuatu for this property given that our data are incom-plete we can use the high AAIC value of excitizenOf todetermine that the property is usually not exclusive and that itis generally not a good indicator of concurrence28

We start by assigning a coefficient to each pair (po) and eachpair (ps) that occur in the dataset where the coefficient is an indi-cator of how exclusive that pair is

Definition 5 (Concurrence coefficients) The concurrence-coefficientof a predicatendashsubject pair (ps) with respect to a graph G is given as

CGethp sTHORN frac141

CardGethp sTHORN AACGethpTHORN

and the concurrence-coefficient of a predicatendashobject pair (po) withrespect to a graph G is analogously given as

ICGethp oTHORN frac141

ICardGethp oTHORN AAICGethpTHORN

Again please note that these coefficients fall into the interval]01] since the denominator by definition is necessarily greaterthan one

To take an example let pwh = foafworkplaceHomepage andsay that we compute AAICG(pwh)=7 from a large number of obser-

28 Here we try to distinguish between propertyndashvalue pairs that are exclusive inreality (ie on the level of whatrsquos signified) and those that are exclusive in the givengraph Admittedly one could think of counter-examples where not including thegeneral statistics of the property may yield a better indication of weightedconcurrence particularly for generic properties that can be applied in many contextsfor example consider the exclusive predicatendashobject pair (skos subject cate-gory Koenigsegg_vehicles) given for a non-exclusive property

vations indicating that each workplace homepage in the graph G islinked to by on average seven employees Further letog = httpgooglecom and assume that og occurs 2000 timesas a value for pwh ICardG(pwhog) = 2000 now ICGethpwh ogTHORN frac14

120007 frac14 000007 Also let od = httpderiie such thatICardG(pwhod)=100 now ICGethpwh odTHORN frac14 1

107 000143 Heresharing DERI as a workplace will indicate a higher level of concur-rence than analogously sharing Google

Finally we require some means of aggregating the coefficientsof the set of pairs that two entities share to derive the final concur-rence measure

Definition 6 (Aggregated concurrence score) Let Z = (z1 zn) be atuple such that for each i = 1 n zi2]01] The aggregatedconcurrence value ACSn is computed iteratively starting withACS0 = 0 then for each k = 1 n ACSk = zk + ACSk1 zkfraslACSk1

The computation of the ACS value is the same process as deter-mining the probability of two independent events occurringmdashP(A _ B) = P(A) + P(B) P(AfraslB)mdashwhich is by definition commuta-tive and associative and thus computation is independent of theorder of the elements in the tuple It may be more accurate to viewthe coefficients as fuzzy values and the aggregation function as adisjunctive combination in some extensions of fuzzy logic [75]

However the underlying coefficients may not be derived fromstrictly independent phenomena there may indeed be correlationbetween the propertyndashvalue pairs that two entities share To illus-trate we reintroduce a relevant example from [34] shown in Fig 8where we see two researchers that have co-authored many paperstogether have the same affiliation and are based in the samecountry

This example illustrates three categories of concurrencecorrelation

(i) same-value correlation where two entities may be linked to thesame value by multiple predicates in either direction (egfoaf made dccreator swrcauthor foafmaker)

(ii) intra-property correlation where two entities that share agiven propertyndashvalue pair are likely to share further valuesfor the same property (eg co-authors sharing one valuefor foafmade are more likely to share further values)

(iii) inter-property correlation where two entities sharing a givenpropertyndashvalue pair are likely to share further distinct butrelated propertyndashvalue pairs (eg having the same valuefor swrcaffiliation and foafbased_near)

Ideally we would like to reflect such correlation in the compu-tation of the concurrence between the two entities

Regarding same-value correlation for a value with multipleedges shared between two entities we choose the shared predicate

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 21: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

96 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

edge with the lowest AA[I]C value and disregard the other edgesie we only consider the most exclusive property used by bothentities to link to the given value and prune the other edges

Regarding intra-property correlation we apply a lower-levelaggregation for each predicate in the set of shared predicate-valuepairs Instead of aggregating a single tuple of coefficients we gen-erate a bag of tuples Z frac14 fZp1 Zpn

g where each element Zpi

represents the tuple of (non-pruned) coefficients generated forthe predicate pi29 We then aggregate this bag as follows

ACSethZTHORN frac14 ACSethACSethZpi AAfrac12ICethpiTHORNTHORNZpi

2ZTHORN

where AA[I]C is either AAC or AAIC dependant on the directionalityof the predicate-value pair observed Thus the total contributionpossible through a given predicate (eg foafmade) has anupper-bound set as its AA[I]C value where each successive sharedvalue for that predicate (eg each successive co-authored paper)contributes positively (but increasingly less) to the overall concur-rence measure

Detecting and counteracting inter-property correlation is per-haps more difficult and we leave this as an open question

612 Implementing entity-concurrence analysisWe aim to implement the above methods using sorts and scans

and we wish to avoid any form of complex indexing or pair-wisecomparison First we wish to extract the statistics relating to the(inverse-)cardinalities of the predicates in the data Given that thereare 23 thousand unique predicates found in the input corpus weassume that we can fit the list of predicates and their associatedstatistics in memorymdashif such were not the case one could consideran on-disk map where we would expect a high cache hit-rate basedon the distribution of property occurrences in the data (cf [32])

Moving forward we can calculate the necessary predicate-levelstatistics by first sorting the data according to natural order(spoc) and then scanning the data computing the cardinality(number of distinct objects) for each (sp) pair and maintainingthe average cardinality for each p found For inverse-cardinalityscores we apply the same process sorting instead by (opsc) or-der counting the number of distinct subjects for each (po) pairand maintaining the average inverse-cardinality scores for eachp After each scan the statistics of the properties are adjustedaccording to the credibility formula in Eq (4)

We then apply a second scan of the sorted corpus first we scanthe data sorted in natural order and for each (sp) pair for each setof unique objects Ops found thereon and for each pair in

fethoi ojTHORN 2 U [ B U [ B j oi oj 2 Ops oi lt ojg

where lt denotes lexicographical order we output the following sex-tuple to an on-disk file

ethoi ojCethp sTHORNp sTHORN

where Cethp sTHORN frac14 1jOps jAACethpTHORN We apply the same process for the other

direction for each (po) pair for each set of unique subjects Spo andfor each pair in

fethsi sjTHORN 2 U [ B U [ B j si sj 2 Spo si lt sjg

we output analogous sextuples of the form

ethsi sj ICethp oTHORN p othornTHORN

We call the sets Ops and their analogues Spo concurrence classesdenoting sets of entities that share the given predicatendashsubjectpredicatendashobject pair respectively Here note that the lsquo + rsquo andlsquo rsquo elements simply mark and track the directionality from whichthe tuple was generated required for the final aggregation of the

29 For brevity we omit the graph subscript

co-efficient scores Similarly we do not immediately materialisethe symmetric concurrence scores where we instead do so at theend so as to forego duplication of intermediary processing

Once generated we can sort the two files of tuples by their nat-ural order and perform a merge-join on the first two elementsmdashgeneralising the directional oisi to simply ei each (eiej) pair de-notes two entities that share some predicate-value pairs in com-mon where we can scan the sorted data and aggregate the finalconcurrence measure for each (eiej) pair using the information en-coded in the respective tuples We can thus generate (triviallysorted) tuples of the form (eiejs) where s denotes the final aggre-gated concurrence score computed for the two entities optionallywe can also write the symmetric concurrence tuples (ejeis) whichcan be sorted separately as required

Note that the number of tuples generated is quadratic with re-spect to the size of the respective concurrence class which be-comes a major impediment for scalability given the presence oflarge such setsmdashfor example consider a corpus containing 1 mil-lion persons sharing the value female for the property foaf

gender where we would have to generate 1062106

2 500 billionnon-reflexive non-symmetric concurrence tuples However wecan leverage the fact that such sets can only invoke a minor influ-ence on the final concurrence of their elements given that themagnitude of the setmdasheg jSpojmdashis a factor in the denominator ofthe computed C(po) score such that Cethp oTHORN 1

jSop j Thus in practice

we implement a maximum-size threshold for the Spo and Ops con-currence classes this threshold is selected based on a practicalupper limit for raw similarity tuples to be generated where theappropriate maximum class size can trivially be determined along-side the derivation of the predicate statistics For the purpose ofevaluation we choose to keep the number of raw tuples generatedat around 1 billion and so set the maximum concurrence classsize at 38mdashwe will see more in Section 64

62 Distributed implementation

Given the previous discussion our distributed implementationis fairly straight-forward as follows

(i) Coordinate the slave machines split their segment of thecorpus according to a modulo-hash function on the subjectposition of the data sort the segments and send the splitsegments to the peer determined by the hash-function theslaves simultaneously gather incoming sorted segmentsand subsequently perform a merge-sort of the segments

(ii) Coordinate the slave machines apply the same operationthis time hashing on objectmdashtriples with rdf type as pred-icate are not included in the object-hashing subsequentlythe slaves merge-sort the segments ordered by object

(iii) Run the slave machines then extract predicate-level statis-tics and statistics relating to the concurrence-class-size dis-tribution which are used to decide upon the class sizethreshold

(iv) Gatherfloodrun the master machine gathers and aggre-gates the high-level statistics generated by the slavemachines in the previous step and sends a copy of the globalstatistics back to each machine the slaves subsequentlygenerate the raw concurrence-encoding sextuples asdescribed before from a scan of the data in both orders

(v) Coordinate the slave machines coordinate the locally gen-erated sextuples according to the first element (join posi-tion) as before

(vi) Run the slave machines aggregate the sextuples coordi-nated in the previous step and produce the final non-sym-metric concurrence tuples

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 22: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

Table 12Breakdown of timing of distributed concurrence analysis

Category 1 2 4 8 Full-8

Min Min Min Min Min

MasterTotal execution time 5162 1000 2627 1000 1378 1000 730 1000 8354 1000Executing 22 04 23 09 31 23 34 46 21 03

Miscellaneous 22 04 23 09 31 23 34 46 21 03Idle (waiting for slaves) 5140 996 2604 991 1346 977 696 954 8333 997

SlaveAvg executing (total exc idle) 5140 996 2578 981 1311 951 664 910 7866 942

Splitsortscatter (subject) 1196 232 593 226 299 217 149 204 1759 211Merge-sort (subject) 421 82 207 79 112 81 57 79 624 75

Splitsortscatter (object) 1154 224 565 215 288 209 143 196 1640 196Merge-sort (object) 372 72 187 71 96 70 51 70 556 66Extract high-level statistics 208 40 101 38 51 37 27 37 284 33Generate raw concurrence tuples 454 88 233 89 116 84 59 80 677 81Cooordinatesort concurrence tuples 976 189 501 191 250 181 125 172 1660 199Merge-sortaggregate similarity 360 70 192 73 100 72 53 72 666 80

Avg idle 22 04 49 19 67 49 65 90 488 58Waiting for peers 00 00 26 10 36 26 32 43 467 56Waiting for master 22 04 23 09 31 23 34 46 21 03

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 97

(vii) Run the slave machines produce the symmetric version ofthe concurrence tuples and coordinate and sort on the firstelement

Here we make heavy use of the coordinate function to aligndata according to the join position required for the subsequentprocessing stepmdashin particular aligning the raw data by subjectand object and then the concurrence tuples analogously

Note that we do not hash on the object position of rdf typetriples our raw corpus contains 2068 million such triples and gi-ven the distribution of class memberships we assume that hashingthese values will lead to uneven distribution of data and subse-quently uneven load balancingmdasheg 792 of all class member-ships are for foafPerson hashing on which would send 1637million triples to one machine which alone is greater than theaverage number of triples we would expect per machine (1398million) In any case given that our corpus contains 105 thousandunique values for rdf type we would expect the average-in-verse-cardinality to be approximately 1970mdasheven for classes withtwo members the potential effect on concurrence is negligible

63 Performance evaluation

We apply our concurrence analysis over the consolidated cor-pus derived in Section 5 The total time taken was 139 h Sortingsplitting and scattering the data according to subject on the slavemachines took 306 h with an average idle time of 77 min(42) Subsequently merge-sorting the sorted segments took113 h with an average idle time of 54 min (8) Analogously sort-ing splitting and scattering the non-rdftype statements by ob-ject took 293 h with an average idle time of 118 min (67)Merge sorting the data by object took 099 h with an average idletime of 38 min (63) Extracting the predicate statistics andthreshold information from data sorted in both orders took29 min with an average idle time of 06 min (21) Generatingthe raw unsorted similarity tuples took 698 min with an averageidle time of 21 min (3) Sorting and coordinating the raw similar-ity tuples across the machines took 1801 min with an average idletime of 141 min (78) Aggregating the final similarity took678 min with an average idle time of 12 min (18)

Table 12 presents a breakdown of the timing of the task Firstwith regards application over the full corpus although this task re-quires some aggregation of global-knowledge by the master

machine the volume of data involved is minimal a total of 21minutes is spent on the master machine performing various minortasks (initialisation remote calls logging aggregation and broad-cast of statistics) Thus 997 of the task is performed in parallelon the slave machine Although there is less time spent waitingfor the master machine compared to the previous two tasks deriv-ing the concurrence measures involves three expensive sortcoor-dinatemerge-sort operations to redistribute and sort the data overthe slave swarm The slave machines were idle for on average 58of the total task time most of this idle time (996) was spentwaiting for peers The master machine was idle for almost the en-tire task with 997 waiting for the slave machines to finish theirtasksmdashagain interleaving a job for another task would have practi-cal benefits

Second with regards the timing of tasks when varying the num-ber of slave machines for 100 million quadruples we see that thepercentage of time spent idle on the slave machines increases from04 (1) to 9 (8) However for each incremental doubling of thenumber of slave machines the total overall task times are reducedby 0509 0525 and 0517 respectively

64 Results evaluation

With respect to data distribution after hashing on subject weobserved an average absolute deviation (average distance fromthe mean) of 176 thousand triples across the slave machines rep-resenting an average 013 deviation from the mean near-optimaldata distribution After hashing on the object of non-rdftype tri-ples we observed an average absolute deviation of 129 million tri-ples across the machines representing an average 11 deviationfrom the mean in particular we note that one machine was as-signed 37 million triples above the mean (an additional 33 abovethe mean) Although not optimal the percentage of data deviationgiven by hashing on object is still within the natural variation inrun-times we have seen for the slave machines during most paral-lel tasks

In Fig 9a and b we illustrate the effect of including increasinglylarge concurrence classes on the number of raw concurrence tuplesgenerated For the predicatendashobject pairs we observe a powerndashlaw(-esque) relationship between the size of the concurrence classand the number of such classes observed Second we observe thatthe number of concurrences generated for each increasing classsize initially remains fairly staticmdashie larger class sizes give

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 23: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(a) Predicate-object pairs

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output size

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

number of classesconcurrence output size

cumulative concurrence output sizecut-off

1

100

10000

1e+006

1e+008

1e+010

1e+012

1e+014

1e+016

1 10 100 1000 10000 100000 1e+006 1e+007

coun

t

concurrence class size

(b) Predicate-subject pairs

Fig 9 Breakdown of potential concurrence classes given with respect to predicatendashobject pairs and predicatendashsubject pairs respectively where for each class size on the x-axis (si) we show the number of classes (ci) the raw non-reflexive non-symmetric concurrence tuples generated (spi frac14 ci

s2ithornsi

2 ) a cumulative count of tuples generated forincreasing class sizes (cspi frac14

Pj6ispj) and our cut-off (si = 38) that we have chosen to keep the total number of tuples at 1 billion (loglog)

Table 13Top five predicates with respect to lowest adjusted average cardinality (AAC)

Predicate Objects Triples AAC

1 foafnick 150433035 150437864 10002 lldpubmedjournal 6790426 6795285 10033 rdfvalue 2802998 2803021 10054 eurostatgeo 2642034 2642034 10055 eurostattime 2641532 2641532 1005

Table 14Top five predicates with respect to lowest adjusted average inverse-cardinality(AAIC)

Predicate Subjects Triples AAIC

1 lldpubmedmeshHeading 2121384 2121384 10092 opiumfieldrecommendation 1920992 1920992 10103 fbtypeobjectkey 1108187 1108187 10174 foafpage 1702704 1712970 10175 skipinionshasFeature 1010604 1010604 1019

98 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

quadratically more concurrences but occur polynomially less of-tenmdashuntil the point where the largest classes that generally onlyhave one occurrence is reached and the number of concurrencesbegins to increase quadratically Also shown is the cumulativecount of concurrence tuples generated for increasing class sizeswhere we initially see a powerndashlaw(-esque) correlation whichsubsequently begins to flatten as the larger concurrence classes be-come more sparse (although more massive)

For the predicatendashsubject pairs the same roughly holds truealthough we see fewer of the very largest concurrence classesthe largest concurrence class given by a predicatendashsubject pairwas 79 thousand versus 19 million for the largest predicatendashob-ject pair respectively given by the pairs (kwamap macsman-ual_rameau_lcsh) and (opiumfieldrating ) Also weobserve some lsquolsquonoisersquorsquo where for milestone concurrence class sizes(esp at 50 100 1000 2000) we observe an unusual amount ofclasses For example there were 72 thousand concurrence classesof precisely size 1000 (versus 88 concurrence classes at size996)mdashthe 1000 limit was due to a FOAF exporter from the hi5comdomain which seemingly enforces that limit on the total lsquolsquofriendscountrsquorsquo of users translating into many users with precisely 1000values for foafknows30 Also for example there were 55 thousandclasses of size 2000 (versus 6 classes of size 1999)mdashalmost all ofthese were due to an exporter from the bio2rdforg domain whichputs this limit on values for the b2rlinkedToFrom property31 Wealso encountered unusually large numbers of classes approximatingthese milestones such as 73 at 2001 Such phenomena explain thestaggered lsquolsquospikesrsquorsquoand lsquolsquodiscontinuitiesrsquorsquo in Fig 9b which can be ob-served to correlate with such milestone values (in fact similar butless noticeable spikes are also present in Fig 9a)

These figures allow us to choose a threshold of concurrence-class size given an upper bound on raw concurrence tuples togenerate For the purposes of evaluation we choose to keep thenumber of materialised concurrence tuples at around 1 billionwhich limits our maximum concurrence class size to 38 (fromwhich we produce 1024 billion tuples 721 million through shared(po) pairs and 303 million through (ps) pairs)

With respect to the statistics of predicates for the predicatendashsubject pairs each predicate had an average of 25229 unique ob-jects for 37953 total triples giving an average cardinality of15 We give the five predicates observed to have the lowest ad-

30 cf httpapihi5comrestprofilefoaf10061469731 cf httpbio2rdforgmeshD000123Q000235

justed average cardinality in Table 13 These predicates are judgedby the algorithm to be the most selective for identifying their sub-ject with a given object value (ie quasi-inverse-functional) notethat the bottom two predicates will not generate any concurrencescores since they are perfectly unique to a given object (ie thenumber of triples equals the number of objects such that no twoentities can share a value for this predicate) For the predicatendashob-ject pairs there was an average of 11572 subjects for 20532 tri-ples giving an average inverse-cardinality of 264 Weanalogously give the five predicates observed to have the lowestadjusted average inverse cardinality in Table 14 These predicatesare judged to be the most selective for identifying their object witha given subject value (ie quasi-functional) however four of thesepredicates will not generate any concurrences since they are per-fectly unique to a given subject (ie those where the number of tri-ples equals the number of subjects)

Aggregation produced a final total of 6369 million weightedconcurrence pairs with a mean concurrence weight of 00159Of these pairs 195 million involved a pair of identifiers from dif-ferent PLDs (31) whereas 6174 million involved identifiers fromthe same PLD however the average concurrence value for an in-tra-PLD pair was 0446 versus 0002 for inter-PLD pairsmdashalthough

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 24: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

Table 15Top five concurrent entities and the number of pairs they share

Entity label 1 Entity label 2 Concur

1 New York City New York State 7912 London England 8943 Tokyo Japan 9004 Toronto Ontario 4185 Philadelphia Pennsylvania 217

Table 16Breakdown of concurrences for top five ranked entities in SWSE ordered by rankwith respectively entity label number of concurrent entities found the label of theconcurrent entity with the largest degree and finally the degree value

Ranked entity Con lsquolsquoClosestrsquorsquo entity Val

1 Tim Berners-Lee 908 Lalana Kagal 0832 Dan Brickley 2552 Libby Miller 0943 updatestatusnet 11 socialnetworkro 0454 FOAF-a-matic 21 foafme 0235 Evan Prodromou 3367 Stav Prodromou 089

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 99

fewer intra-PLD concurrences are found they typically have higherconcurrences32

In Table 15 we give the labels of top five most concurrent enti-ties including the number of pairs they sharemdashthe concurrencescore for each of these pairs was gt09999999 We note that theyare all locations where particularly on WIKIPEDIA (and thus filteringthrough to DBpedia) properties with location values are typicallyduplicated (eg dbpdeathPlace dbpbirthPlace dbphead-quartersmdashproperties that are quasi-functional) for exampleNew York City and New York State are both the dbpdeath-

Place of dbpediaIsacc_Asimov etcIn Table 16 we give a description of the concurrent entities

found for the top-five ranked entitiesmdashfor brevity again we showentity labels In particular we note that a large amount of concur-rent entities are identified for the highly-ranked persons With re-spect to the strongest concurrences (i) Tim and his former studentLalana share twelve primarily academic links coauthoring six pa-pers (ii) Dan and Libby co-founders of the FOAF project share87 links primarily 73 foafknows relations to and from the samepeople as well as a co-authored paper occupying the same profes-sional positions etc33 (iii) updatestatusnet and socialnet-

workro share a single foafaccountServiceHomepage linkfrom a common user (iv) similarly the FOAF-a-matic and foafme

services share a single mvcbgeneratorAgent inlink (v) finallyEvan and Stav share 69 foafknows inlinks and outlinks exportedfrom the identica service

34 We instead refer the interested reader to [9] for some previous works on the topic

7 Entity disambiguation and repair

We have already seen thatmdasheven by only exploiting the formallogical consequences of the data through reasoningmdashconsolidationmay already be imprecise due to various forms of noise inherent inLinked Data In this section we look at trying to detect erroneouscoreferences as produced by the extended consolidation approachintroduced in Section 5 Ideally we would like to automatically de-tect revise and repair such cases to improve the effective precisionof the consolidation process As discussed in Section 26 OWL 2 RLRDF contains rules for automatically detecting inconsistencies in

32 Note that we apply this analysis over the consolidated data and thus this is anapproximative reading for the purposes of illustration we extract the PLDs fromcanonical identifiers which are choosen based on arbitrary lexical ordering

33 Notably Leigh Dodds (creator of the FOAF-a-matic service) is linked by theproperty quaffing drankBeerWith to both

of general inconsistency repair for Linked Data35 We use syntactic shortcuts in our file to denote when s = s0 andor o = o0

Maintaining the additional rewrite information during the consolidation process istrivial where the output of consolidating subjects gives quintuples (spocs0) whichare then sorted and consolidated by o to produce the given sextuples

36 httpmodelsokkamorgENS-core-vocabularycountry_of_residence37 httpontologydesignpatternsorgcpowlfsdas

RDF data representing formal contradictions according to OWLsemantics Where such contradictions are created due to consoli-dation we believe this to be a good indicator of erroneousconsolidation

Once erroneous equivalences have been detected we wouldlike to subsequently diagnose and repair the coreferences involvedOne option would be to completely disband the entire equivalenceclass however there may often be only one problematic identifierthat eg causes inconsistency in a large equivalence classmdashbreak-ing up all equivalences would be a coarse solution Instead hereinwe propose a more fine-grained method for repairing equivalenceclasses that incrementally rebuilds coreferences in the set in amanner that preserves consistency and that is based on the originalevidences for the equivalences found as well as concurrence scores(discussed in the previous section) to indicate how similar the ori-ginal entities are Once the erroneous equivalence classes havebeen repaired we can then revise the consolidated corpus to reflectthe changes

71 High-level approach

The high-level approach is to see if the consolidation of anyentities conducted in Section 5 lead to any novel inconsistenciesand subsequently recant the equivalences involved thus it isimportant to note that our aim is not to repair inconsistenciesin the data but instead to repair incorrect consolidation sympto-mised by inconsistency34 In this section we (i) describe whatforms of inconsistency we detect and how we detect them (ii)characterise how inconsistencies can be caused by consolidationusing examples from our corpus where possible (iii) discuss therepair of equivalence classes that have been determined to causeinconsistency

First in order to track which data are consolidated and whichnot in the previous consolidation step we output sextuples ofthe form

ethsp o c s0 o0THORN

where s p o c denote the consolidated quadruple containingcanonical identifiers in the subjectobject position as appropri-ate and s0 and o0 denote the input identifiers prior toconsolidation35

To detect inconsistencies in the consolidated corpus we use theOWL 2 RLRDF rules with the false consequent [24] as listed inTable 4 A quick check of our corpus revealed that one documentprovides eight owlAsymmetricProperty and 10 owlIrre-

flexiveProperty axioms36 and one directory gives 9 owlAll-

DisjointClasses axioms37 and where we found no other OWL2 axioms relevant to the rules in Table 4

We also consider an additional rule whose semantics are indi-rectly axiomatised by the OWL 2 RLRDF rules (through prp-fp dt-diff and eq-diff1) but which we must support directly since we donot consider consolidation of literals

p a owlFunctionalProperty

x p l1 l2

l1 owldifferentFrom l2

)false

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 25: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

100 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

where we underline the terminological pattern To illustrate wetake an example from our corpus

Terminological [httpdbpediaorgdata3

lengthrdf]

dpolength rdftype owlFunctionalProperty

Assertional [httpdbpediaorgdata

Fiat_Nuova_500xml]dbpediaFiat_Nuova_5000 dpolength3546

^^xsddouble

Assertional [httpdbpediaorgdata

Fiat_500xml]dbpediaFiat_Nuova0 dpolength 297

^^xsddouble

where we use the prime symbol [0] to denote identifiers consideredcoreferent by consolidation Here we see two very closely relatedmodels of cars consolidated in the previous step but we now iden-tify that they have two different values for dpolengthmdasha func-tional-propertymdashand thus the consolidation raises an inconsistency

Note that we do not expect the owldifferentFrom assertionto be materialised but instead intend a rather more relaxed seman-tics based on a heurisitic comparison given two (distinct) literalbindings for l1 and l2 we flag an inconsistency iff (i) the datavalues of the two bindings are not equal (standard OWL semantics)and (ii) their lower-case string value (minus language-tags and data-types) are lexically unequal In particular the relaxation is inspiredby the definition of the FOAF (datatype) functional propertiesfoafage foafgender and foafbirthday where the rangeof these properties is rather loosely defined a generic range ofrdfsLiteral is formally defined for these properties with infor-mal recommendations to use malefemale as gender values andMM-DD syntax for birthdays but not giving recommendations fordatatype or language-tags The latter relaxation means that wewould not flag an inconsistency in the following data

3

co

Terminological [httpxmlnscomfoafspecindexrdf]

9 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e f rn=aacute=ndezSergiohtml0 h t t p w w w i n f o r m a t i k u n i - t r i e r d e l e y d b i n d i c e s a - t r e e a nzuolaSergio_Fern=aacute=ndezhtml1

foafPerson owldisjointWith foafDocument

Assertional [fictional]

exTed foafage 25

exTed foafage lsquolsquo25rsquorsquoexTed foafgender lsquolsquomalersquorsquoexTed foafgender lsquolsquoMalersquorsquoenexTed foafbirthday lsquolsquo25-05rsquorsquo^^xsdgMonthDayexTed foafbirthday lsquolsquo25-05rsquorsquo

With respect to these consistency checking rules we considerthe terminological data to be sound38 We again only consider ter-minological axioms that are authoritatively served by their sourcefor example the following statement

sioc User owldisjointWith foaf Person

would have to be served by a document that either siocUser orfoafPerson dereferences to (either the FOAF or SIOC vocabularysince the axiom applies over a combination of FOAF and SIOC asser-tional data)

Given a grounding for such an inconsistency-detection rule wewish to analyse the constants bound by variables in join positionsto determine whether or not the contradiction is caused by consol-idation we are thus only interested in join variables that appear at

8 In any case we always source terminological data from the raw unconsolidatedrpus

least once in a consolidatable position (thus we do not supportdt-not-type) and where the join variable is lsquolsquointra-assertionalrsquorsquo(exists twice in the assertional patterns) Other forms of inconsis-tency must otherwise be present in the raw data

Further note that owlsameAs patternsmdashparticularly in rule eq-diff1mdashare implicit in the consolidated data eg consider

Assertional [httpwwwwikierorgfoafrdf]

wikierwikier0 owldifferentFrom

eswc2006psergio-fernandezrsquo

where an inconsistency is implicitly given by the owlsameAs rela-tion that holds between the consolidated identifiers wikierwiki-errsquo and eswc2006psergio-fernandezrsquo In this example thereare two Semantic Web researchers respectively named lsquolsquoSergioFernaacutendezrsquorsquo39 and lsquolsquoSergio Fernaacutendez Anzuolarsquorsquo40 who both partici-pated in the ESWC 2006 conference and who were subsequentlyconflated in the lsquolsquoDogFoodrsquorsquo export41 The former Sergio subsequentlyadded a counter-claim in his FOAF file asserting the above owldif-ferentFrom statement

Other inconsistencies do not involve explicit owlsameAs pat-terns a subset of which may require lsquolsquopositiversquorsquo reasoning to be de-tected eg

Terminological [httpxmlnscomfoafspec]

foafPerson owldisjointWith foafOrganization

foafknows rdfsdomain foafPerson

Assertional [httpidenticaw3cfoaf]

identica484040 foafknows identica45563

Assertional [inferred by prp-dom]

identica484040 a foafPerson

Assertional [httpwwwdatasemanticweborg

organizationw3crdf]

semweborgw3c0 a foafOrganization

where the two entities are initially consolidated due to sharing thevalue httpwwww3org for the inverse-functional propertyfoafhomepage the W3C is stated to be a foafOrganization

in one document and is inferred to be a person from its identicaprofile through rule prp-dom finally the W3C is a member of twodisjoint classes forming an inconsistency detectable by rule cax-dw42

Once the inconsistencies caused by consolidation have beenidentified we need to perform a repair of the equivalence class in-volved In order to resolve inconsistencies we make three simpli-fying assumptions

(i) the steps involved in the consolidation can be rederived withknowledge of direct inlinks and outlinks of the consolidatedentity or reasoned knowledge derived from there

(ii) inconsistencies are caused by pairs of consolidatedidentifiers

(iii) we repair individual equivalence classes and do not considerthe case where repairing one such class may indirectlyrepair another

httpwwwdatasemanticweborgdumpsconferenceseswc-2006-completerdf2 Note that this also could be viewed as a counter-example for using inconsisten-es to recant consolidation where arguably the two entities are coreferent from aractical perspective even if lsquolsquoincompatiblersquorsquo from a symbolic perspective

3

Fe4

A4

4

cip

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 26: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 101

With respect to the first item our current implementation willbe performing a repair of the equivalence class based on knowl-edge of direct inlinks and outlinks available through a simplemerge-join as used in the previous section this thus precludesrepair of consolidation found through rule cls-maxqc2 whichalso requires knowledge about the class memberships of the out-linked node With respect to the second item we say that incon-sistencies are caused by pairs of identifiersmdashwhat we termincompatible identifiersmdashsuch that we do not consider inconsisten-cies caused with respect to a single identifier (inconsistencies notcaused by consolidation) and do not consider the case where thealignment of more than two identifiers are required to cause asingle inconsistency (not possible in our rules) where such a casewould again lead to a disjunction of repair strategies With re-spect to the third item it is possible to resolve a set of inconsis-tent equivalence classes by repairing one for example considerrules with multiple lsquolsquointra-assertionalrsquorsquo join-variables (prp-irpprp-asyp) that can have explanations involving multiple consoli-dated identifiers as follows

Terminological [fictional]

made owlpropertyDisjointWith maker

Assertional [fictional]

exAZ00 maker exentcons0

dblpAntoine_Zimmermann00 made dblpHoganZUPD150

where both equivalences together constitute an inconsistencyRepairing one equivalence class would repair the inconsistencydetected for both we give no special treatment to such a caseand resolve each equivalence class independently In any casewe find no such incidences in our corpus these inconsistenciesrequire (i) axioms new in OWL 2 (rules prp-irp prp-asyp prp-pdw and prp-adp) (ii) alignment of two consolidated sets ofidentifiers in the subjectobject positions Note that such casescan also occur given the recursive nature of our consolidationwhereby consolidating one set of identifiers may lead to align-ments in the join positions of the consolidation rules in the nextiteration however we did not encounter such recursion duringthe consolidation phase for our data

The high-level approach to repairing inconsistent consolidationis as follows

(i) rederive and build a non-transitive symmetric graph ofequivalences between the identifiers in the equivalenceclass based on the inlinks and outlinks of the consolidatedentity

(ii) discover identifiers that together cause inconsistency andmust be separated generating a new seed equivalence classfor each and breaking the direct links between them

(iii) assign the remaining identifiers into one of the seed equiva-lence classes based on

43 We note the possibility of a dual correspondence between our lsquolsquobottom-uprsquoapproach to repair and the lsquolsquotop-downrsquorsquo minimal hitting set techniques introduced byReiter [55]

(a) minimum distance in the non-transitive equivalenceclass(b) if tied use a concurrence score

Given the simplifying assumptions we can formalise the prob-lem thus we denote the graph of non-transitive equivalences fora given equivalence class as a weighted graph G = (VEx) suchthat V B [ U is the set of vertices E B [ U B [ U is theset of edges and x E N R is a weighting function for theedges Our edge weights are pairs (dc) where d is the numberof sets of input triples in the corpus that allow to directly derivethe given equivalence relation by means of a direct owlsameAsassertion (in either direction) or a shared inverse-functional ob-ject or functional subjectmdashloosely the independent evidences for

the relation given by the input graph excluding transitiveowlsameAs semantics c is the concurrence score derivable be-tween the unconsolidated entities and is used to resolve ties(we would expect many strongly connected equivalence graphswhere eg the entire equivalence class is given by a singleshared value for a given inverse-functional property and we thusrequire the additional granularity of concurrence for repairing thedata in a non-trivial manner) We define a total lexicographicalorder over these pairs

Given an equivalence class Eq U [ B that we perceive to causea novel inconsistencymdashie an inconsistency derivable by the align-ment of incompatible identifiersmdashwe first derive a collection ofsets C = C1 Cn C 2U[B such that each Ci 2 C Ci Eq de-notes an unordered pair of incompatible identifiers

We then apply a straightforward greedy consistent clusteringof the equivalence class loosely following the notion of a mini-mal cutting (see eg [63]) For Eq we create an initial set of sin-gleton sets Eq0 each containing an individual identifier in theequivalence class Now let U(EiEj) denote the aggregated weightof the edge considering the merge of the nodes of Ei and thenodes of Ej in the graph the pair (dc) such that d denotes theunique evidences for equivalence relations between all nodes inEi and all nodes in Ej and such that c denotes the concurrencescore considering the merge of entities in Ei and Ejmdashintuitivelythe same weight as before but applied as if the identifiers in Ei

and Ej were consolidated in the graph We can apply the follow-ing clustering

ndash for each pair of sets EiEj 2 Eqn such that 9= ab 2 Ca 2 Eqib 2Eqj (ie consistently mergeable subsets) identify the weightsof U(EiEj) and order the pairings

ndash in descending order with respect to the above weights mergeEiEj pairsmdashsuch that neither Ei or Ej have already been mergedin this iterationmdashproducing En+1 at iterationrsquos end

ndash iterate over n until fixpoint

Thus we determine the pairs of incompatible identifiers thatmust necessarily be in different repaired equivalence classesdeconstruct the equivalence class and then begin reconstructingthe repaired equivalence class by iteratively merging the moststrongly linked intermediary equivalence classes that will not con-tain incompatible identifers43

72 Implementing disambiguation

The implementation of the above disambiguation process canbe viewed on two levels the macro level which identifies and col-lates the information about individual equivalence classes andtheir respectively consolidated inlinksoutlinks and the micro le-vel which repairs individual equivalence classes

On the macro level the task assumes input data sorted byboth subject (spocs0o0) and object (opsco0s0) again such thatso represent canonical identifiers and s0 o0 represent the originalidentifiers as before Note that we also require the assertedowlsameAs relations encoded likewise Given that all the re-quired information about the equivalence classes (their inlinksoutlinks derivable equivalences and original identifiers) are gath-ered under the canonical identifiers we can apply a straight-for-ward merge-join on s-o over the sorted stream of data and batchconsolidated segments of data

On a micro level we buffer each individual consolidated seg-ment into an in-memory index currently these segments fit in

rsquo

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 27: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

102 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

memory where for the largest equivalence classes we note thatinlinksoutlinks are commonly duplicatedmdashif this were not thecase one could consider using an on-disk index which shouldbe feasible given that only small batches of the corpus are underanalysis at each given time We assume access to the relevantterminological knowledge required for reasoning and the predi-cate-level statistics derived during from the concurrence analysisWe apply scan-reasoning and inconsistency detection over eachbatch and for efficiency skip over batches that are not sympto-mised by incompatible identifiers

For equivalence classes containing incompatible identifiers wefirst determine the full set of such pairs through application ofthe inconsistency detection rules usually each detection givesa single pair where we ignore pairs containing the same identi-fier (ie detections that would equally apply over the unconsol-idated data) We check the pairs for a trivial solution if allidentifiers in the equivalence class appear in some pair we checkwhether the graph formed by the pairs is strongly connected inwhich case the equivalence class must necessarily be completelydisbanded

For non-trivial repairs we extract the explicit owlsameAs rela-tions (which we view as directionless) and reinfer owlsameAs

relations from the consolidation rules encoding the subsequentgraph We label edges in the graph with a set of hashes denotingthe input triples required for their derivation such that the cardi-nality of the hashset corresponds to the primary edge weight Wesubsequently use a priority-queue to order the edge-weights andonly materialise concurrence scores in the case of a tie Nodes inthe equivalence graph are merged by combining unique edgesand merging the hashsets for overlapping edges Using these oper-ations we can apply the aformentioned process to derive the finalrepaired equivalence classes

In the final step we encode the repaired equivalence classesin memory and perform a final scan of the corpus (in naturalsorted order) revising identifiers according to their repairedcanonical term

73 Distributed implementation

Distribution of the task becomes straight-forward assumingthat the slave machines have knowledge of terminological datapredicate-level statistics and already have the consolidationencoding sextuples sorted and coordinated by hash on s and oNote that all of these data are present on the slave machines fromprevious tasks for the concurrence analysis we in fact maintainsextuples during the data preparation phase (although not re-quired by the analysis)

Thus we are left with two steps

Table 17Breakdown of timing of distributed disambiguation and repair

Category 1 2

Min Min

MasterTotal execution time 1147 1000 613 10Executing

Miscellaneous Idle (waiting for slaves) 1147 1000 612 10

SlaveAvg Executing (total exc idle) 1147 1000 590 9

Identify inconsistencies and repairs 747 651 385 6Repair Corpus 400 348 205 3

Avg Idle 23Waiting for peers 00 00 22Waiting for master

ndash Run each slave machine performs the above process on its seg-ment of the corpus applying a merge-join over the data sortedby (spocs0o0) and (opsco0s0) to derive batches of consoli-dated data which are subsequently analysed diagnosed anda repair derived in memory

ndash Gatherrun the master machine gathers all repair informationfrom all slave machines and floods the merged repairs to theslave machines the slave machines subsequently perform thefinal repair of the corpus

74 Performance evaluation

We ran the inconsistency detection and disambiguation overthe corpora produced by the extended consolidation approachFor the full corpus the total time taken was 235 h The inconsis-tency extraction and equivalence class repair analysis took 12 hwith a significant average idle time of 75 min (93) in particularcertain large batches of consolidated data took significant amountsof time to process particularly to reason over The additional ex-pense is due to the relaxation of duplicate detection we cannotconsider duplicates on a triple level but must consider uniquenessbased on the entire sextuple to derive the information required forrepair we must apply many duplicate inferencing steps On theother hand we can skip certain reasoning paths that cannot leadto inconsistency Repairing the corpus took 098 h with an averageidle time of 27 min

In Table 17 we again give a detailed breakdown of the timingsfor the task With regards the full corpus note that the aggregationof the repair information took a negligible amount of time andwhere only a total of one minute is spent on the slave machineMost notably load-balancing is somewhat of an issue causingslave machines to be idle for on average 72 of the total tasktime mostly waiting for peers This percentagemdashand the associatedload-balancing issuesmdashwould likely be aggrevated further by moremachines or a higher scale of data

With regards the timings of tasks when varying the number ofslave machines as the number of slave machines doubles the totalexecution times decrease by factors of 0534 0599 and0649 respectively Unlike for previous tasks where the aggrega-tion of global knowledge on the master machine poses a bottle-neck this time load-balancing for the inconsistency detectionand repair selection task sees overall performance start to convergewhen increasing machines

75 Results evaluation

It seems that our discussion of inconsistency repair has beensomewhat academic from the total of 282 million consolidated

4 8 Full-8

min Min Min

00 367 1000 238 1000 1411 1000 01 01 01 01 00 366 999 238 999 1411 1000

63 336 915 223 937 1309 92828 232 634 172 722 724 51335 103 281 51 215 585 41537 31 85 15 63 102 7237 31 84 15 62 101 72 01 01

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 28: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

Table 20Results of manual repair inspection

Manual Auto Count

SAME frasl 223 443DIFFERENT frasl 261 519UNCLEAR ndash 19 38frasl SAME 361 746frasl DIFFERENT 123 254SAME SAME 205 919SAME DIFFERENT 18 81DIFFERENT DIFFERENT 105 402DIFFERENT SAME 156 597

Table 18Breakdown of inconsistency detections for functional-properties where properties inthe same row gave identical detections

Functional property Detections

1 foafgender 602 foafage 323 dboheight dbolength 44 dbowheelbase dbowidth 35 atomowlbody 16 locaddress locname locphone 1

Table 19Breakdown of inconsistency detections for disjoint-classes

Disjoint Class 1 Disjoint Class 2 Detections

1 foafDocument foafPerson 1632 foafDocument foafAgent 1533 foafOrganization foafPerson 174 foafPerson foafProject 35 foafOrganization foafProject 1

1

10

100

1000

1 10 100

num

ber o

f rep

aire

d eq

uiva

lenc

e cl

asse

s

number of partitions

Fig 10 Distribution of the number of partitions the inconsistent equivalenceclasses are broken into (loglog)

1

10

100

1000

1 10 100 1000

num

ber o

f par

titio

ns

number of IDs in partition

Fig 11 Distribution of the number of identifiers per repaired partition (loglog)

44 We were particularly careful to distinguish information resourcesmdashsuch asWIKIPEDIA articlesmdashand non-information resources as being DIFFERENT

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 103

batches to check we found 280 equivalence classes (001)mdashcon-taining 3985 identifiersmdashcausing novel inconsistency Of these280 inconsistent sets 185 were detected through disjoint-classconstraints 94 were detected through distinct literal values for in-verse-functional properties and one was detected through

owldifferentFrom assertions We list the top five functional-properties given non-distinct literal values in Table 18 and thetop five disjoint classes in Table 19 note that some inconsistentequivalent classes had multiple detections where eg the classfoafPerson is a subclass of foafAgent and thus an identicaldetection is given for each Notably almost all of the detections in-volve FOAF classes or properties (Further we note that betweenthe time of the crawl and the time of writing the FOAF vocabularyhas removed disjointness constraints between the foafDocu-

ment and foafPersonfoafAgent classes)In terms of repairing these 280 equivalence classes 29 (104)

had to be completely disbanded since all identifiers were pairwiseincompatible A total of 905 partitions were created during the re-pair with each of the 280 classes being broken up into an averageof 323 repaired consistent partitions Fig 10 gives the distributionof the number of repaired partitions created for each equivalenceclass 230 classes were broken into the minimum of two partitionsand one class was broken into 96 partitions Fig 11 gives the dis-tribution of the number of identfiers in each repaired partitioneach partition contained an average of 44 equivalent identifierswith 577 identifiers in singleton partitions and one partition con-taining 182 identifiers

Finally although the repaired partitions no longer cause incon-sistency this does not necessarily mean that they are correct Fromthe raw equivalence classes identified to cause inconsistency weapplied the same sampling technique as before we extracted503 identifiers and for each randomly sampled an identifier orig-inally thought to be equivalent We then manually checkedwhether or not we considered the identifiers to refer to the sameentity or not In the (blind) manual evaluation we selected optionUNCLEAR 19 times option SAME 232 times (of which 49 were TRIVIALLY

SAME) and option DIFFERENT 249 times44 We checked our manuallyannotated results against the partitions produced by our repair tosee if they corresponded The results are enumerated in Table 20Of the 481 (clear) manually-inspected pairs 361 (721) remainedequivalent after the repair and 123 (254) were separated Of the223 pairs manually annotated as SAME 205 (919) also remainedequivalent in the repair whereas 18 (181) were deemed differenthere we see that the repair has a 919 recall when reesatablishingequivalence for resources that are the same Of the 261 pairs verifiedas DIFFERENT 105 (402) were also deemed different in the repairwhereas 156 (597) were deemed to be equivalent

Overall for identifying which resources were the same andshould be re-aligned during the repair the precision was 0568with a recall of 0919 and F1-measure of 0702 Conversely theprecision for identifying which resources were different and shouldbe kept separate during the repair the precision was 0854 with arecall of 0402 and F1-measure of 0546

Interpreting and inspecting the results we found that the

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 29: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

104 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

inconsistency-based repair process works well for correctly fixingequivalence classes with one or two lsquolsquobad applesrsquorsquo However forequivalence classes which contain many broken resources wefound that the repair process tends towards re-aligning too manyidentifiers although the output partitions are consistent theyare not correct For example we found that 30 of the pairs whichwere annotated as different but were re-consolidated by the repairwere from the operacom domain where users had specified var-ious nonsense values which were exported as values for inverse-functional chat-ID properties in FOAF (and were missed by ourblack-list) This lead to groups of users (the largest containing205 users) being initially identified as being co-referent througha web of sharing different such values In these groups inconsis-tency was caused by different values for gender (a functional prop-erty) and so they were passed into the repair process Howeverthe repair process simply split the groups into male and femalepartitions which although now consistent were still incorrectThis illustrates a core weakness of the proposed repair processconsistency does not imply correctness

In summary for an inconsistency-based detection and repairprocess to work well over Linked Data vocabulary publisherswould need to provide more sensible axioms which indicate wheninstance data are inconsistent These can then be used to detectmore examples of incorrect equivalence (amongst other use-cases[9]) To help repair such cases in particular the widespread useand adoption of datatype-properties which are both inverse-func-tional and functional (ie one-to-one properties such as isbn)would greatly help to automatically identify not only which re-sources are the same but to identify which resources are differentin a granular manner45 Functional properties (eg gender) or dis-joint classes (eg Person Document) which have wide catchmentsare useful to detect inconsistencies in the first place but are not idealfor performing granular repairs

8 Critical discussion

In this section we provide critical discussion of our approachfollowing the dimensions of the requirements listed at the outset

With respect to scale on a high level our primary means oforganising the bulk of the corpus is external-sorts characterisedby the linearithmic time complexity O(nfrasllog(n)) external-sortsdo not have a critical main-memory requirement and are effi-ciently distributable Our primary means of accessing the data isvia linear scans With respect to the individual tasks

ndash our current baseline consolidation approach relies on an in-memory owlsameAs index however we demonstrate an on-disk variant in the extended consolidation approach

ndash the extended consolidation currently loads terminological datathat is required by all machines into memory if necessary weclaim that an on-disk terminological index would offer goodperformance given the distribution of class and property mem-berships where we believe that a high cache-hit rate would beenjoyed

ndash for the entity concurrence analysis the predicate level statisticsrequired by all machines is small in volumemdashfor the momentwe do not see this as a serious factor in scaling-up

ndash for the inconsistency detection we identify the same potentialissues with respect to terminological data also given largeequivalence classes with a high number of inlinks and outlinks

45 Of course in OWL (2) DL datatype-properties cannot be inverse-functional butLinked Data vocabularies often break this restriction [29]

we would encounter main-memory problems where webelieve that an on-disk index could be applied assuming a rea-sonable upper limit on batch sizes

With respect to efficiency

ndash the on-disk aggregation of owlsameAs data for the extendedconsolidation has proven to be a bottleneckmdashfor efficient pro-cessing at higher levels of scale distribution of this task wouldbe a priority which should be feasible given that again theprimitive operations involved are external sorts and scans withnon-critical in-memory indices to accelerate reaching thefixpoint

ndash although we typically observe terminological data to constitutea small percentage of Linked Data corpora (01 in our corpuscf [3133]) at higher scales aggregating the terminological datafor all machines may become a bottleneck and distributedapproaches to perform such would need to be investigatedsimilarly as we have seen large terminological documentscan cause load-balancing issues46

ndash for the concurrence analysis and inconsistency detection dataare distributed according to a modulo-hash function on the sub-ject and object position where we do not hash on the objects ofrdftype triples although we demonstrated even data distri-bution by this approach for our current corpus this may nothold in the general case

ndash as we have already seen for our corpus and machine count thecomplexity of repairing consolidated batches may become anissue given large equivalence class sizes

ndash there is some notable idle time for our machines where thetotal cost of running the pipeline could be reduced by interleav-ing jobs

With the exception of our manually derived blacklist for valuesof (inverse-)functional-properties the methods presented hereinhave been entirely domain-agnostic and fully automatic

One major open issue is the question of precision and recallGiven the nature of the tasksmdashparticularly the scale and diversityof the datasetsmdashwe believe that deriving an appropriate gold stan-dard is currently infeasible

ndash the scale of the corpus precludes manual or semi-automaticprocesses

ndash any automatic process for deriving the gold standard wouldmake redundant the approach to test

ndash results derived from application of the methods on subsets ofmanually verified data would not be equatable to the resultsderived from the whole corpus

ndash even assuming a manual approach were feasible oftentimesthere is no objective criteria for determining what precisely sig-nifies whatmdashthe publisherrsquos original intent is often ambiguous

Thus we prefer symbolic approaches to consolidation and dis-ambiguation that are predicated on the formal semantics of thedata where we can appeal to the fact that incorrect consolidationis due to erroneous data not an erroneous approach Without a for-mal means of sufficiently evaluating the results we employ statis-tical methods for applications where precision is not a primaryrequirement We also use formal inconsistency as an indicator ofimprecise consolidation although we have shown that this methoddoes not currently yield many detections In general we believethat for the corpora we target such research can only find itrsquos real

6 We reduce terminological statements on a document-by-document basis accord-g to unaligned blank-node positions for example we prune RDF collectionsentified by blank-nodes that do not join with eg an owlunionOf axiom

4

inid

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 30: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 105

litmus test when integrated into a system with a critical user-baseFinally we have only briefly discussed issues relating to

web-tolerance eg spamming or conflicting data With respectto such consideration we currently (i) derive and use a blacklistfor common void values (ii) consider authority for terminologicaldata [3133] and (iii) try to detect erroneous consolidationthrough consistency verification One might question an approachwhich trusts all equivalences asserted or derived from the dataAlong these lines we track the original pre-consolidation identifi-ers (in the form of sextuples) which can be used to revert errone-ous consolidation In fact similar considerations can be appliedmore generally to the re-use of identifiers across sources givingspecial consideration to the consolidation of third party data aboutan entity is somewhat fallacious without also considering the thirdparty contribution of data using a consistent identifier In bothcases we track the context of (consolidated) statements whichat least can be used to verify or post-process sources47 Currentlythe corpus we use probably does not exhibit any significant deliber-ate spamming but rather indeliberate noise We leave more maturemeans of handling spamming for future work (as required)

9 Related work

91 Related fields

Work relating to entity consolidation has been researched in thearea of databases for a number of years aiming to identify and pro-cess co-referent signifiers with works under the titles of recordlinkage record or data fusion merge-purge instance fusion andduplicate identification and (ironically) a plethora of variationsthereupon see [474413512] etc and surveys at [198] Unlikeour approach which leverages the declarative semantics of thedata in order to be domain agnostic such systems usually operategiven closed schemasmdashsimilarly they typically focus on string-similarity measures and statistical analysis Haas et al note thatlsquolsquoin relational systems where the data model does not provide primi-tives for making same-as assertions lt gt there is a value-based notionof identityrsquorsquo [25] However we note that some works have focusedon leveraging semantics for such tasks in relation databases egFan et al [21] leverage domain knowledge to match entities whereinterestingly they state lsquolsquoreal life data is typically dirty ltthusgt it isoften necessary to hinge on the semantics of the datarsquorsquo

Some other worksmdashmore related to Information Retrieval andNatural Language Processingmdashfocus on extracting coreferent entitynames from unstructured text tying in with aligning the results ofNamed Entity Recognition where for example Singh et al [60]present an approach to identify coreferences from a corpus of 3million natural language lsquolsquomentionsrsquorsquo of persons where they buildcompound lsquolsquoentitiesrsquorsquo out of the individual mentions

92 Instance matching

With respect to RDF one area of research also goes by the nameinstance matching for example in 2009 the Ontology AlignmentEvaluation Initiative48 introduced a new test track on instancematching49 We refer the interested reader to the results of OAEI2010 [20] for further information where we now discuss some re-cent instance matching systems It is worth noting that in contrastto our scenario many instance matching systems take as input asmall number of instance setsmdashie consistently named datasetsmdash

47 Although it must be said we currently do not track the steps used to derive theequivalences involved in consolidation which would be expensive to materialise andmaintain

48 OAEI httpoaeiontologymatchingorg49 Instance data matching httpwwwinstancematchingorg

across which similarities and coreferences are computed Thus theinstance matching systems mentioned in this section are not specif-ically tailored for processing Linked Data (and most do not presentexperiments along these lines)

The LN2R [56] system incorporates two methods for performinginstance matching (i) L2R applies deductive techniques whererules are used to model the semantics of the respective ontol-ogymdashparticularly functional properties inverse-functional proper-ties and disjointness constraintsmdashwhich are then used to inferconsistent coreference matches (ii) N2R is an inductive numericapproach which uses the results of L2R to perform unsupervisedmatching where a system of linear-equations representing simi-larities is seeded using text-based analysis and then used to com-pute instance similarity by iterative resolution Scalability is notdirectly addressed

The MOMA [68] engine matches instances from two distinctdatasets to help ensure high precision only members of the sameclass are matched (which can be determined perhaps by an ontol-ogy-matching phase) A suite of matching tools is provided byMOMA along with the ability to pipeline matchers and to selecta variety of operators for merging composing and selecting (thres-holding) results Along these lines the system is designed to facil-itate a human expert in instance-matching focusing on a differentscenario to ours presented herein

The RiMoM [41] engine is designed for ontology matchingimplementing a wide range of strategies which can be composedand combined in a graphical user interface strategies can also beselected by the engine itself based on the nature of the inputMatching results from different strategies can be composed usinglinear interpolation Instance matching techniques mainly rely ontext-based analysis of resources using Vector Space Models andIR-style measures of similarity and relevance Large scale instancematching is enabled by an inverted-index over text terms similarin principle to candidate reduction techniques discussed later inthis section They currently do not support symbolic techniquesfor matching (although they do allow disambiguation of instancesfrom different classes)

Nikolov et al [48] present the KnoFuss architecture for aligningdata on an assertional level they identify a three phase processinvolving coreferencing (finding equivalent individuals) conflictdetection (finding inconsistencies caused by the integration) andinconsistency resolution For the coreferencing the authors intro-duce and discuss approaches incorporating string similarity mea-sures and class-based machine learning techniques Although thehigh-level process is similar to our own the authors do not addressscalability concerns

In motivating the design of their RDF-AI system Scharffe et al[58] identify four steps in aligning datasets align interlink fuseand post-process The align process identifies equivalences betweenentities in the two datasets the interlink process materialises owlsameAs relations between the two datasets the aligning stepmerges the two datasets (on both a terminological and assertionallevel possibly using domain-specific rules) and the post-processingphase subsequently checks the consistency of the output dataAlthough parts of this conceptual process echoes our own theRDF-AI system itself differs greatly from our work First the authorsfocus on the task of identifying coreference across two distinct data-sets whereas we aim to identify coreference in a large lsquolsquobag of in-stancesrsquorsquo Second the authors do not emphasise scalability wherethe RDF datasets are loaded into memory and processed using thepopular Jena framework similarly the system proposes performingusing pair-wise comparison for eg using string matching tech-niques (which we do not support) Third although inconsistencydetection is mentioned in the conceptual process the authors donot discuss implementation details for the RDF-AI system itself

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 31: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

106 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

Noessner et al [49] present an approach for aligning two A-Boxes described using the same T-Box in particular they leveragesimilarity measures introduced by Stuckenschmidt [65] and definean optimisation problem to identify the alignment that generatesthe highest weighted similarity between the two A-Boxes underanalysis they use Integer Linear Programming to generate theoptimal alignment encoding linear constraints to enforce valid(ie consistency preserving) one-to-one functional mappingsAlthough they give performance results they do not directly ad-dress scalability Their method for comparing entities is similarin practice to ours they measure the lsquolsquooverlapping knowledgersquorsquo be-tween two entities counting how many assertions are true aboutboth The goal is to match entities such that the resulting consol-idation is consistent the measure of overlap is maximal

Like us Castano et al [12] approach instance matching fromtwo distinct perspectives (i) determine coreferent identifiers (ii)detect similar individuals based on the data they share Much oftheir work is similar in principle to ours in particular they use rea-soning for identifying equivalences and use a statistical approachfor identifying properties lsquolsquowith high identification powerrsquorsquo Theydo not consider use of inconsistency detection for disambiguatingentities and perhaps more critically only evaluate with respect toa dataset containing 15 thousand entities

Cudreacute-Mauroux et al [17] present the idMesh system whichleverages user-defined associations and probabalistic methods toderive entity-level relationships including resolution of conflictsthey also delineate entities based on lsquolsquotemporal discriminationrsquorsquowhereby coreferent entities may predate or postdate one anothercapturing a description thereof at a particular point in time The id-Mesh system itself is designed over a peer-to-peer network withcentralised coordination However evaluation is over syntheticdata where they only demonstrate a maximum scale involving8000 entities and 24000 links over 400 machines the evaluationof performance focuses on network traffic and message exchangeas opposed to time

93 Domain-specific consolidation

Various authors have looked at applying consolidation over do-main-specific RDF corpora eg Sleeman and Finin look at usingmachine learning techniques to consolidate FOAF personal profileinformation [61] Shi et al similarly look at FOAF-specific align-ment techniques [59] using inverse-functional properties and fuz-zy string matching Jentzsch et al examine alignment of publisheddrug data [38]50 Raimond et al look at interlinking RDF from themusic-domain [54] Monaghan and Orsquo Sullivan apply consolidationto photo annotations expressed in RDF [46]

Salvadores et al [57] present the LinksB2N system which aimsto perform scalable integration of RDF data particularly focusingon evaluation over corpora from the marketing domain howevertheir methods are not specific to this domain They do not leveragethe semantics of the data for performing consolidation insteadusing similarity measures based on the idea that lsquolsquothe unique com-bination of RDF predicates associated with RDF resources is whatdefines their existence as uniquersquorsquo [57] This is a similar intuitionto that behind our concurrence analysis but we again questionthe validity of such an assumption for consolidation particularlygiven incomplete data and the Open World Assumption underlyingRDF(S)OWLmdashwe view an RDF resource as a description of some-thing signified and would wish to avoid conflating unique signifi-ers even if they match precisely with respect to their description

50 In fact we believe that this work generates the incorrect results observable inTable 6 cf httpgroupsgooglecomgrouppedantic-webbrowse_threadthreadad740f7052cc3a2d

94 Large-scaleWeb-data consolidation

Like us various authors have investigated consolidation tech-niques tailored to the challengesmdashesp scalability and noisemdashofresolving coreference in heterogeneous RDF Web data

On a larger scale and following a similar approach to us Huet al [3635] have recently investigated a consolidation systemmdashcalled ObjectCoRefmdashfor application over 500 million triples ofWeb data They too define a baseline consolidation (which they callthe kernel) built by applying reasoning on owlsameAsowlInverseFunctionalProperty owlFunctionalProper-

ty and owlcardinality No other OWL constructs or reasoningis used when building the kernel Then starting from the corefer-ents in the kernel they use a learning technique to find the mostdiscriminative properties Further they reuse the notion of average(inverse) cardinality [34] to find frequent property combinationpairs A small scale experiment is conducted that shows good re-sults At large scale although they manage to build the kernel forthe 500 million triples dataset they only apply and evaluate theirstatistical method on a very small subset confessing that the ap-proach would not be feasible for the complete set

The Sindice system has historically used inverse-functional prop-erties to find equivalent identifiers also investigating some bespokelsquolsquoschema-levelrsquorsquo reasoning to identify a wider range of such proper-ties [50] However Sindice no longer uses such OWL features fordetermining coreference of entities51 The related Sigma search sys-tem [69] internally uses IR-style string-based measures (eg TF-IDFscores) to mashup results in their engine compared to our systemthey do not prioritise precision where mashups are designed to havea high recall of related data and to be disambiguated manually by theuser using the interface This can be verified anecdotally by searchingfor various ambiguous terms in their public interface52

Bishop et al [6] apply reasoning over 12 billion Linked Data tri-ples using the BigOWLIM reasoner however this dataset is manu-ally selected as a merge of a number of smaller known datasets asopposed to an arbitrary corpus They discuss well-known optimisa-tions similar to our canonicalisation as a necessary means of avoid-ing the quadratic nature of traditional replacement semantics forowlsameAs

95 Large-scaledistributed consolidation

With respect to distributed consolidation Urbani et al [70] pro-pose the WebPie system which uses MapReduce to perform pDreasoning [67] using a cluster of commodity hardware similar toourselves The pD ruleset contains rules for handling owl sameAsreplacement inverse-functional properties and functional-proper-ties but not for cardinalities (which in any case we demonstratedto be ineffective over out corpus) The authors also discuss a canon-icalisation approach for handling equivalent identifiers They dem-onstrate their methods over 100 billion triples of synthetic LUBMdata over 64 machines however they do not present evaluationover Linked Data do not perform any form of similarity or concur-rence measures do not consider inconsistency detection (not givenby the pD ruleset or by their corpora) and generally have asomewhat different focus scalable distributed rule-basedmaterialisation

96 Candidate reduction

For performing large-scale consolidation one approach is togroup together sets of candidates within which there are likely

1 From personal communication with the Sindice team2 httpsigmasearchq=paris

5

5

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 32: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 107

to be coreferent identifiers Such candidate reduction then by-passes the need for complete pair-wise comparison and can enablethe use of more expensive matching methods in closed regionsThis is particularly relevant for text-based analyses which we havenot tackled in this paper However the principle of candidatereduction generalises to any form of matching that cannot be per-formed in a complete pair-wise manner

Song and Heflin [62] investigate scalable methods to prune thecandidate-space for the instance matching task focusing on sce-narios such as our own First discriminating propertiesmdashie thosefor which a typical value is shared by few instancesmdashare used togroup initial candidate sets Second the authors propose buildingtextual descriptions of instances from surrounding text and usingan IR-style inverted index to query and group lexically similar in-stances Evaluation demonstrates feasibility for matching up-toone million instances although the authors claim that the ap-proach can scale further

Ioannou et al [37] also propose a system called RDFSim tocluster similar candidates based on associated textual informationVirtual prose documents are constructed for resources from sur-rounding text which are then encoded in a special index whichusing Locality Sensitive Hashing (LSH) The core idea of LSH is tohash similar virtual documents into regions of space where dis-tance measures can be used to identify similar resources A set ofhash functions can be employed for these purposes and lsquolsquoclosersquorsquodocuments are then subject to other similarity measures the RDF-Sim system currently uses the Jaccard coefficient to compute tex-tual similarities for these purposes

97 Naminglinking resources on the Web

A number of works have investigated the fundamentals of re-source naming and linking on the Web proposing schemes ormethods to bootstrap agreement between remote publishers oncommon resource URIs or to enable publishers to better link theircontribution with other external datasets

With respect to URI naming on the Web Bouquet et al [10] pro-pose OKKAM a centralised naming architecture for minting URI sig-nifiers for the Web The OKKAM system thus supports entity lookupsfor publishers where a search providing a description of the entityin question returns a set of possible candidates Publishers canthen re-use the URI for the candidate which most closely matchestheir intention in their local data with the intuition that other pub-lishers have done the same thus all users of the system will havetheir naming schemes aligned However we see such a centralisedlsquolsquolsquonaming authorityrsquorsquo as going against the ad-hoc decentralisedscale-free nature of the Web More concretely for example havingone URI per resource goes against current Linked Data principleswhich encourage use of dereferenceable URIs one URI can onlydereference to one document so which document should that beand who should decide

Online systems RKBExplorer [2322]53 ltsameAsgt54 and Object-Coref [15]55 offer on-demand querying for owlsameAs relationsfound for a given input URI which they internally compute andstore the former focus on publishing owlsameAs relations forauthors and papers in the area of scientific publishing with the lattertwo systems offering more general owlsameAs relationships be-tween Linked Data identifiers In fact many of the owlsameAs rela-tions we consume are published as Linked Data by the RKBExplorersystem

Volz et al [72] present the Silk framework for creating andmaintaining inter-linkage between domain-specific RDF datasets

53 httpwwwrkbexplorercomsameAs54 httpsameasorg55 httpwsnjueducnobjectcoref 56 Again our inspection results are available at httpswsederiorgentity

in particular this framework provides publishers with a means ofdiscovering and creating owlsameAs links between data sourcesusing domain-specific rules and parameters Thereafter publisherscan integrate discovered links into their exports enabling betterlinkage of the data and subsequent consolidation by data consum-ers this framework goes hand-in-hand with our approach produc-ing the owlsameAs relations which we consume

Popitsch and Haslhofer present discussion on the problem ofbroken links in Linked Data identifying structurally broken links(the Web of Datarsquos version of a lsquolsquodeadlinkrsquorsquo) and semantically bro-ken links where the original meaning of an identifier changes aftera link has been remotely asserted [52] The authors subsequentlypresent the DSNotify system which monitors dynamicity over a gi-ven subset of Linked Data and can detect and act upon changesmdasheg to notify another agent or correct broken linksmdashand can alsobe used to indirectly link the dynamic target

98 Analyses of owlsameAs

Recent papers have analysed and discussed the use of owlsam-eAs on the Web These papers have lead to debate on whetherowlsameAs is appropriate for modelling coreference across theWeb

Halpin et al [26] look at the semantics and current usage ofowl sameAs in Linked Data discussing issues relating to identityand providing four categories of owl sameAs usage to relate enti-ties that are closely related but for which the semantics of owlsameAsmdashparticularly substitutionmdashdoes not quite hold Theauthors also take and manually inspect 500 owlsameAs linkssampled (using logarithmic weights for each domain) from Webdata Their experiments suggest that although the majority ofowlsameAs relations are considered correct many were stillincorrect and disagreement between the judges indicates thatthe quality of specific owlsameAs links can be subjective [26]In fact their results are much more pessimistic than ours with re-gards the quality of owlsameAs on the Web whereas we estab-lished a precision figure of 972 for explicit owlsameAs

through manual inspection Halpin et al put the figure at 51(plusmn21) Even taking the most optimistic figure of 72 from Halpinrsquospaper there is a wide gap to our result Possible reasons for the dis-crepancy include variations in sampling techniques as well as dif-ferent criteria for manual judging56

Ding et al [18] also provide quantitative analysis of owlsam-eAs usage in the BTC-2010 dataset some of these results corre-spond with analogous measures we have presented inSection 44 but for a different (albeit similar) sampled corpus Theyfound that URIs with at least one coreferent identifier had an aver-age of 24 identifiers (this corresponds closely with our measure-ment of 265 identifiers per equivalence-class for baselineconsolidation) The average path length of owlsameAs was 107indicating that few transitive links are given The largest equiva-lence class found was 5 thousand (versus 8481 in our case) Theyalso discuss the landscape of owlsameAs linkage between differ-ent publishers of Linked Data

10 Conclusion

In this paper we have provided comprehensive discussion onscalable and distributed methods for consolidating matchingand disambiguating entities present in a large static Linked Datacorpus Throughout we have focused on the scalability and practi-calities of applying our methods over real arbitrary Linked Data ina domain agnostic and (almost entirely) automatic fashion We

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 33: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

108 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

have shown how to use explicit owlsameAs relations in the datato perform consolidation and subsequently expanded this ap-proach leveraging the declarative formal semantics of the corpusto materialise additional owlsameAs relations We also presenteda scalable approach to identify weighted concurrence relations forentities that share many inlinks outlinks and attribute values wenote that many of those entities demonstrating the highest concur-rence were not coreferent Next we presented an approach usinginconsistencies to disambiguate entities and subsequently repairequivalence classes we found that this approach currently derivesfew diagnoses where the granularity of inconsistencies withinLinked Data is not sufficient for accurately pinpointing all incorrectconsolidation Finally we tempered our contribution with criticaldiscussion particularly focusing on scalability and efficiencyconcerns

We believe that the area of research touched upon in this pa-permdashparticularly as applied to large scale Linked Data corporamdashisof particular significance given the rapid growth in popularity ofLinked Data publishing As the scale and diversity of the Web ofData expands scalable and precise data integration technique willbecome of vital importance particularly for data warehousingapplications we see the work presented herein as a significantstep in the right direction

Acknowledgements

We thank Andreas Harth for involvement in early works Wealso thank the anonymous reviewers and the editors for their timeand valuable comments The work presented in this paper has been

Table A1lists the prefixes used throughout the paper

Prefix URI

lsquolsquoT-Box prefixesrsquorsquoatomowl httpbbb2r httpbiob2rr httpbiodbo httpdbdbp httpdbecs httprdeurostat httponfb httprdfoaf httpxmgeonames httpwwkwa httpknlldpubmed httplinloc httpswmo httppumvcb httpweopiumfield httprdowl httpwwquaffing httppurdf httpwwrdfs httpwwskipinions httpskskos httpww

lsquolsquoA-Box prefixesrsquorsquoavtimbl httpwwdbpedia httpdbdblpperson httpwweswc2006p httpwwidenticau httpidekingdoms httplodmacs httpstisemweborg httpwwswid httpsemtimblfoaf httpwwvperson httpvirwikier httpwwyagor httpww

funded in part by Science Foundation Ireland under Grant No SFI08CEI1380 (Lion-2) and by an IRCSET postgraduate scholarship

Appendix A Prefixes

Table A1 lists the prefixes used throughout the paper

References

[1] Riccardo Albertoni Monica De Martino Semantic similarity of ontologyinstances tailored on the application context in Proc of OTM 2006 Part IVolume 4275 of LNCS Springer 2006 pp 1020ndash1038

[2] Riccardo Albertoni Monica De Martino Asymmetric and context-dependentsemantic similarity among ontology instances J Data Sem 4900 (10) (2008)1ndash30

[3] David Beckett Tim Berners-Lee Turtle ndash Terse RDF Triple Language W3C TeamSubmission (2008) httpwwww3orgTeamSubmissionturtle

[4] Tim Berners-Lee Linked Data Design issues for the World Wide Web WorldWide Web Consortium 2006 Available from lthttpwwww3orgDesignIssuesLinkedDatahtmlgt

[5] Philip A Bernstein Sergey Melnik Peter Mork Interactive schema translationwith instance-level mappings in Proc of VLDB 2005 ACM Press 2005 pp1283ndash1286

[6] Barry Bishop Atanas Kiryakov Damyan Ognyanov Ivan Peikov ZdravkoTashev Ruslan Velkov Factforge a fast track to the web of data Semantic Web2 (2) (2011) 157ndash166

[7] Christian Bizer Richard Cyganiak Tom Heath How to Publish Linked Data onthe Web linkeddataorg Tutorial (2008) Available from lthttplinkeddataorgdocshow-to-publishgt

[8] Jens Bleiholder Felix Naumann Data fusion ACM Comput Surv 41 (1) (2008)[9] Piero A Bonatti Aidan Hogan Axel Polleres Luigi Sauro Robust and scalable

linked data reasoning incorporating provenance and trust annotations J WebSem 9 (2) (2011) 165ndash201

[10] Paolo Bouquet Heiko Stoermer Michele Mancioppi Daniel GiacomuzziOkkaM towards a solution to the lsquolsquoidentity crisisrsquorsquo on the Semantic Web

lfishnetworkatom-owl2006-06-062rdforgbio2rdf2rdforgbio2rdf_resource

pediaorgontologypediaorgpropertyfecssotonacukontologyecstologycentralcom200901eurostatnsffreebasecomnslnscomfoaf01wgeonamesorgontology

owledgewebsemanticweborgheterogeneityalignmentkedlifedatacomresourcepubmedderiorg200607locationlocrlorgontologymobnsnetmvcb

fopiumfieldcomlastfmspecww3org200207owl

rlorgnetschemasquaffingww3org19990222-rdf-syntax-nsww3org200001rdf-schema

ipforwardnetskipforwardpageseederskipinionsww3org200402skoscore

wadvogatoorgpersontimblfoafrdfpediaorgresource

w4wiwissfu-berlindedblpresourcepersonweswc2006orgpeople

nticausergeospeciesorgkingdoms

tchcsvunlalignmentsmacswdatasemanticweborgorganizationanticweborgidww3orgPeopleBerners-Leecard

tuosoopenlinkswcomdataspacepersonwwikierorgfoafrdfwmpiideyagoresource

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 34: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110 109

Proceedings of SWAP 2006 the Third Italian Semantic Web Workshop Volume201 of CEUR Workshop Proceedings (2006)

[11] Dan Brickley RV Guha RDF Vocabulary Description Language 10 RDFSchema W3C Recommendation (2004) httpwwww3orgTRrdf-schema

[12] Silvana Castano Alfio Ferrara Stefano Montanelli Davide Lorusso InstanceMatching for Ontology Population in Salvatore Gaglio Ignazio InfantinoDomenico Saccaacute (Eds) SEBD 2008 pp 121ndash132

[13] Zhaoqi Chen Dmitri V Kalashnikov Sharad Mehrotra Exploiting relationshipsfor object consolidation in IQIS rsquo05 Proceedings of the Second internationalworkshop on Information quality in information systems ACM Press NewYork NY USA 2005 pp 47ndash58

[14] Gong Cheng Weiyi Ge Honghan Wu Yuzhong Qu Searching Semantic Webobjects based on class hierarchies in Proceedings of Linked Data on the WebWorkshop (2008)

[15] Gong Cheng Yuzhong Qu Searching linked objects with falcons approachimplementation and evaluation Int J Sem Web Inf Syst 5 (3) (2009) 49ndash70

[16] Martine De Cock Etienne E Kerre On (un)suitable fuzzy relations to modelapproximate equality Fuzzy Sets Syst 133 (2) (2003) 137ndash153

[17] Philippe Cudreacute-Mauroux Parisa Haghani Michael Jost Karl Aberer Hermannde Meer idMesh graph-based disambiguation of linked data WWW (2009)591ndash600

[18] Li Ding Joshua Shinavier Zhenning Shangguan Deborah L McGuinnessSameAs networks and beyond analyzing deployment status and implicationsof owl sameAs in linked data in Proceedings of the Ninth InternationalSemantic Web Conference on the Semantic Web ndash Volume Part I ISWCrsquo10Springer-Verlag Berlin Heidelberg 2010 pp 145ndash160

[19] AK Elmagarmid PG Ipeirotis VS Verykios Duplicate record detection asurvey IEEE Trans Knowledge Data Eng 19 (1) (2007) 1ndash16

[20] Jeacuterocircme Euzenat Alfio Ferrara Christian Meilicke Andriy Nikolov Juan PaneFranccedilois Scharffe Pavel Shvaiko Heiner Stuckenschmidt Ondrej Svaacuteb-Zamazal Vojtech Svaacutetek Caacutessia Trojahn Results of the ontology alignmentevaluation initiative 2010 OM (2010)

[21] Wenfei Fan Xibei Jia Jianzhong Li Shuai Ma Reasoning about record matchingrules PVLDB 2 (1) (2009) 407ndash418

[22] Hugh Glaser Afraz Jaffri Ian Millard Managing co-reference on the SemanticWeb In Proc of LDOW 2009 (2009)

[23] Hugh Glaser Ian Millard Afraz Jaffri RKBExplorercom A knowledge driveninfrastructure for linked data providers in ESWC Demo Lecture Notes inComputer Science Springer 2008 pp 797ndash801

[24] Bernardo Cuenca Grau Boris Motik Zhe Wu Achille Fokoue Carsten Lutz(Eds) OWL 2 Web Ontology Language Profiles W3C Working Draft (2008)httpwwww3orgTRowl2-profiles

[25] Laura M Haas Martin Hentschel Donald Kossmann Reneacutee J Miller Schemaand data a holistic approach to mapping resolution and fusion in informationintegration ER (2009) 27ndash40

[26] Harry Halpin Patrick J Hayes James P McCusker Deborah L McGuinnessHenry S Thompson When owlsameAs Isnrsquot the Same An Analysis of Identityin Linked Data in International Semantic Web Conference (1) (2010) 305ndash320

[27] Harry Halpin Ivan Herman Pat Hayes When owlsameAs isnrsquot the Same AnAnalysis of Identity Links on the Semantic Web in Linked Data on the WebWWW2010 Workshop (LDOW2010) (2010)

[28] Patrick Hayes RDF Semantics W3C Recommendation (2004) httpwwww3orgTRrdf-mt

[29] Aidan Hogan Exploiting RDFS and OWL for Integrating Heterogeneous Large-Scale Linked Data Corpora PhD Thesis Digital Enterprise Research InstituteNational University of Ireland Galway 2011 Available from lthttpaidanhogancomdocsthesisgt

[30] Aidan Hogan Andreas Harth Stefan Decker Performing Object Consolidationon the Semantic Web Data Graph in First I3 Workshop Identity IdentifiersIdentification Workshop 2007

[31] Aidan Hogan Andreas Harth Axel Polleres Scalable Authoritative OWLReasoning for the Web Int J Semantic Web Inf Syst 5 (2) (2009) 49ndash90

[32] Aidan Hogan Andreas Harth Juumlrgen Umbrich Sheila Kinsella Axel PolleresStefan Decker Searching and browsing linked data with SWSE the SemanticWeb search engine J Web Sem 9 (4) (2011) 365ndash401

[33] Aidan Hogan Jeff Z Pan Axel Polleres Stefan Decker SAOR template ruleoptimisations for distributed reasoning over 1 billion linked data triples inInternational Semantic Web Conference 2010

[34] Aidan Hogan Axel Polleres Juumlrgen Umbrich Antoine Zimmermann Someentities are more equal than others statistical methods to consolidate LinkedData in Fourth International Workshop on New Forms of Reasoning for theSemantic Web Scalable and Dynamic (NeFoRS2010) 2010

[35] Wei Hu Jianfeng Chen Yuzhong Qu A self-training approach for resolvingobject coreference on the Semantic Web WWW (2011) 87ndash96

[36] Wei Hu Yuzhong Qu Xingzhi Sun Bootstrapping object coreferencing on theSemantic Web J Comput Sci Technol 26 (4) (2011) 663ndash675

[37] Ekaterini Ioannou Odysseas Papapetrou Dimitrios Skoutas Wolfgang NejdlEfficient semantic-aware detection of near duplicate resources in ESWC (2)2010 pp 136ndash150

[38] Anja Jentzsch Jun Zhao Oktie Hassanzadeh Kei-Hoi Cheung MatthiasSamwald Bo Andersson Linking Open Drug Data in InternationalConference on Semantic Systems (I-SEMANTICSrsquo09) 2009

[39] Frank Klawonn Should fuzzy equality and similarity satisfytransitivityComments on the paper by M De Cock and E Kerre Fuzzy SetsSyst 133 (2) (2003) 175ndash180

[40] Vladimir Kolovski Zhe Wu George Eadon Optimizing enterprise-scale OWL 2RL reasoning in a relational database system in International Semantic WebConference (2010)

[41] Juanzi Li Jie Tang Yi Li Qiong Luo Rimom A dynamic multistrategy ontologyalignment framework IEEE Trans Knowl Data Eng 21 (8) (2009) 1218ndash1232

[42] Deborah L McGuinness Frank van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation (2004) Available from lthttpwwww3orgTRowl-featuresgt

[43] Georgios Meditskos Nick Bassiliades DLEJena A practical forward-chainingOWL 2 RL reasoner combining Jena and Pellet J Web Sem 8 (1) (2010) 89ndash94

[44] Martin Michalowski Snehal Thakkar Craig A Knoblock Exploiting secondarysources for automatic object consolidation in Proceeding of 2003 KDDWorkshop on Data Cleaning Record Linkage and Object Consolidation (2003)

[45] Alistair Miles Thomas Baker Ralph Swick Best Practice Recipes for PublishingRDF Vocabularies March 2006 Available from lthttpwwww3orgTR2006WD-swbp-vocab-pub-20060314lgt Superceded by Berrueta and Phippslthttpwwww3orgTRswbp-vocab-pubgt

[46] Fergal Monaghan David OrsquoSullivan Leveraging ontologies context and socialnetworks to automate photo annotation in SAMT (2007) 252ndash255

[47] HB Newcombe JM Kennedy SJ Axford AP James Automatic Linkage ofVital Records Computers can be used to extract lsquolsquofollow-uprsquorsquo statistics offamilies from files of routine records Science 130 (1959) 954ndash959

[48] Andriy Nikolov Victoria S Uren Enrico Motta Anne N De Roeck Integration ofSemantically Annotated Data by the KnoFuss Architecture in EKAW 2008 pp265ndash274

[49] Jan Noessner Mathias Niepert Christian Meilicke Heiner StuckenschmidtLeveraging terminological structure for object reconciliation in ESWC (2)2010 pp 334ndash348

[50] Eyal Oren Renaud Delbru Michele Catasta Richard Cyganiak HolgerStenzhorn Giovanni Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J Metadata Semant Ontol 3 (1) (2008) 37ndash52

[51] Eyal Oren Spyros Kotoulas George Anadiotis Ronny Siebes Annette ten TeijeFrank van Harmelen Marvin distributed reasoning over large-scale SemanticWeb data J Web Sem 7 (4) (2009) 305ndash316

[52] Niko Popitsch Bernhard Haslhofer DSNotify handling broken links in the webof data in WWW (2010) 761ndash770

[53] Eric Prudrsquohommeaux Andy Seaborne SPARQL Query Language for RDF W3CRecommendation (2008) httpwwww3orgTRrdf-sparql-query

[54] Yves Raimond Christopher Sutton Mark B Sandler Interlinking Music-Related Data on the Web IEEE MultiMedia 16 (2) (2009) 52ndash63

[55] Raymond Reiter A Theory of Diagnosis from First Principles Artif Intell 32 (1)(1987) 57ndash95

[56] Fatiha Saıs Nathalie Pernelle Marie-Christine Rousset Combining a logicaland a numerical method for data reconciliation J Data Sem 12 (2009)66ndash94

[57] Manuel Salvadores Gianluca Correndo Bene Rodriguez-Castro NicholasGibbins John Darlington Nigel R Shadbolt LinksB2N automatic dataintegration for the Semantic Web in OTM Conferences (2) 2009 pp 1121ndash1138

[58] F Scharffe Y Liu C Zhou RDF-AI an architecture for RDF datasets matchingfusion and interlink in IJCAI 2009 Workshop on Identity Reference andKnowledge Representation (IR-KR) (2009)

[59] Lian Shi Diego Berrueta Sergio Fernaacutendez Luis Polo Silvino FernaacutendezSmushing RDF instances are Alice and Bob the same open source developerin PICKME Workshop (2008)

[60] Sameer Singh Michael L Wick Andrew McCallum Distantly labeling data forlarge scale cross-document coreference CoRR (2010) abs10054298

[61] Jennifer Sleeman Tim Finin Learning co-reference relations for FOAFinstances in Poster and Demo Session at ISWC (2010)

[62] Dezhao Song Jeff Heflin Automatically generating data linkages using adomain-independent candidate selection approach in International SemanticWeb Conference (2011)

[63] Mechthild Stoer Frank Wagner A simple min-cut algorithm J ACM 44 (4)(1997) 585ndash591

[64] Michael Stonebraker The Case for Shared Nothing IEEE Database Eng Bull 9(1) (1986) 4ndash9

[65] Heiner Stuckenschmidt A semantic similarity measure for ontology-basedinformation in FQAS rsquo09 Proceedings of the Eighth International Conferenceon Flexible Query Answering Systems Springer-Verlag Berlin Heidelberg2009 pp 406ndash417

[66] Robert Endre Tarjan A class of algorithms which require nonlinear time tomaintain disjoint sets J Comput Syst Sci 18 (2) (1979) 110ndash127

[67] Herman J ter Horst Completeness decidability and complexity of entailmentfor RDF Schema and a semantic extension involving the OWL vocabulary JWeb Sem 3 (2005) 79ndash115

[68] Andreas Thor Erhard Rahm Moma ndash a mapping-based object matchingsystem in CIDR (2007) 247ndash258

[69] Giovanni Tummarello Richard Cyganiak Michele Catasta Szymon DanielczykStefan Decker Sigma live views on the Web of data in Semantic WebChallenge (ISWC2009) (2009)

[70] Jacopo Urbani Spyros Kotoulas Jason Maassen Frank van Harmelen Henri EBal OWL reasoning with WebPIE calculating the closure of 100 billion triplesin ESWC (1) 2010 pp 213ndash227

[71] Jacopo Urbani Spyros Kotoulas Eyal Oren Frank van Harmelen Scalabledistributed reasoning using mapreduce in International Semantic WebConference (2009) 634ndash649

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References
Page 35: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora

110 A Hogan et al Web Semantics Science Services and Agents on the World Wide Web 10 (2012) 76ndash110

[72] Julius Volz Christian Bizer Martin Gaedke Georgi Kobilarov Discovering andmaintaining links on the Web of data in International Semantic WebConference (2009) 650ndash665

[73] Denny Vrandeciacutec Markus Kroumltzsch Sebastian Rudolph Uta Loumlsch Leveragingnon-lexical knowledge for the linked open data Web Review of April Foolrsquosday Transactions (RAFT) 5 (2010) 18ndash27

[74] Jesse Weaver James A Hendler Parallel materialization of the finite RDFSclosure for hundreds of millions of triples in International Semantic WebConference (ISWC2009) (2009) 682ndash697

[75] Lotfi A Zadeh George J Klir Bo Yuan Fuzzy Sets Fuzzy Logic Fuzzy SystemsWorld Scientific Press 1996

  • Scalable and distributed methods for entity matching consolidation and disambiguation over linked data corpora
    • 1 Introduction
    • 2 Preliminaries
      • 21 RDF
      • 22 RDFS OWL and owlsameAs
      • 23 Inference rules
      • 24 T-split rules
      • 25 Authoritative T-split rules
      • 26 OWL 2 RLRDF rules
      • 27 Distribution architecture
      • 28 Experimental setup
        • 3 Experimental corpus
        • 4 Base-line consolidation
          • 41 High-level approach
          • 42 Distributed approach
          • 43 Performance evaluation
          • 44 Results evaluation
            • 5 Extended reasoning consolidation
              • 51 High-level approach
              • 52 Distributed approach
              • 53 Performance evaluation
              • 54 Results evaluation
                • 6 Statistical concurrence analysis
                  • 61 High-level approach
                    • 611 Quantifying concurrence
                    • 612 Implementing entity-concurrence analysis
                      • 62 Distributed implementation
                      • 63 Performance evaluation
                      • 64 Results evaluation
                        • 7 Entity disambiguation and repair
                          • 71 High-level approach
                          • 72 Implementing disambiguation
                          • 73 Distributed implementation
                          • 74 Performance evaluation
                          • 75 Results evaluation
                            • 8 Critical discussion
                            • 9 Related work
                              • 91 Related fields
                              • 92 Instance matching
                              • 93 Domain-specific consolidation
                              • 94 Large-scaleWeb-data consolidation
                              • 95 Large-scaledistributed consolidation
                              • 96 Candidate reduction
                              • 97 Naminglinking resources on the Web
                              • 98 Analyses of owlsameAs
                                • 10 Conclusion
                                • Acknowledgements
                                • Appendix A Prefixes
                                • References