Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

Copyright 2009 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute www.deri.ie

0:55:04 1

Enhancing Large-Scale RDF Web Knowledge-bases for Query

Answering

Mini-VivaAidan Hogan

12th February, 2010

2


Overview

Fig 1: RDF Web Dataset

explicit data

implicit data

Topic of today’s talk: How to exploit implicit

data for “query answering”

3


Query Answering…

…over RDF Web data…(Linked Data if you prefer)

Search engines such as SWSE, Sindice, Falcons, Swoogle, Watson etc.

SPARQL endpoints over Web data such as YARS2, Virtuoso, etc.

4


ex:Aidan ex:presented ex:RR2009Talk .

deri:Aidan ex:presented deri:FridayTalk .

Problem: Synonymous Omissions

Query:Give me all talks presented by Aidan

ex:Aidan ex:presented ?talk .

IMP

LIC

T EX

PLIC

IT

5


ex:Aidan ex:presented deri:FridayTalk .

ex:Aidan ex:presented ex:FridayTalk .

Problem: Synonymous Duplicates



IMP

LIC

T EX

PLIC

IT

6


ex:RR2009Talk ex:presentedBy ex:Aidan .

Problem: Incomplete Answers



IMP

LIC

T EX

PLIC

IT

7


Solution: Publish Complete Data?



IMP

LIC

T EX

PLIC

IT



8



Solution: Write Query in many ways?


ex:Aidan ex:presented ?talk .?talk ex:presentedBy ex:Aidan .

IMP

LIC

T EX

PLIC

IT

9


deri:Aidan ex:presented deri:FridayTalk .deri:Aidan owl:sameAs ex:Aidan .ex:RR2009Talk ex:presentedBy ex:Aidan .ex:presentedBy owl:inverseOf ex:presented .



Solution: Exploit OWL and RDFS…



IMP

LIC

T EX

PLIC

IT

10


OWL / RDFS

(loosely) Define the semantics of classes and properties… (define relationships between terms)

ex:presentedBy owl:inverseOf ex:presented .

ex:presentedBy rdfs:domain ex:Talk .

ex:presentedBy rdfs:range ex:Person .

Define equivalence between individuals (owl:sameAs)

ex:Aidan owl:sameAs deri:Aidan .

Give machines an insight into the meaning of data

Allows for reasoning

11


Reasoning

(Loosely) Use the semantics of classes and properties—defined in RDFS and OWL—to make implicit knowledge explicit

One approach is using rules: IF condition THEN consequent

?p1 owl:inverseOf ?p2 . ?s ?p1 ?o . => ?o ?p2 ?s .

ex:presentedBy owl:inverseOf ex:presented .ex:FridayTalk ex:presentedBy ex:Aidan .=> ex:Aidan ex:presented ex:FridayTalk .

?p rdfs:domain ?c . ?s ?p ?o . => ?s rdf:type ?c .

?p rdfs:range ?c . ?s ?p ?o . => ?o rdf:type ?c .

ex:presentedBy rdfs:domain ex:Talk .ex:FridayTalk ex:presentedBy ex:Aidan .=> ex:FridayTalk rdf:type ex:Talk .

ex:presentedBy rdfs:range ex:Person .ex:FridayTalk ex:presentedBy ex:Aidan .=> ex:Aidan rdf:type ex:Person .

12


deri:Aidan ex:presented deri:FridayTalk .deri:Aidan owl:sameAs ex:Aidan .ex:RR2009Talk ex:presentedBy ex:Aidan .ex:presentedBy owl:inverseOf ex:presented .

Reasoning: Make Explicit the Implicit



ex:Aidan ex:presented deri:FridayTalk .ex:Aidan ex:presented ex:RR2009Talk .

IMP

LIC

T

EX

PLIC

IT

13


Web Reasoning: Challenges

Scalability Billions or tens of billions of statements (for the moment)

– Near linear scale!!!

Noisy data Inconsistencies galore Publishing errors “Ontology hijacking”

14


Web Reasoning: Challenges

Challenges (Semantic Web Wikipedia Article) Some of the challenges for the Semantic Web include vastness, vagueness, uncertainty, inconsistency and deceit. Automated reasoning systems will have to deal

with all of these issues in order to deliver on the promise of the Semantic Web. Vastness: The World Wide Web contains at least 48 billion pages as of this writing (August 2, 2009). The SNOMED CT medical terminology ontology contains 370,000 class names,

and existing technology has not yet been able to eliminate all semantically duplicated terms. Any automated reasoning system will have to deal with truly huge inputs. Vagueness: These are imprecise concepts like "young" or "tall". This arises from the vagueness of user queries, of concepts represented by content providers, of matching query

terms to provider terms and of trying to combine different knowledge bases with overlapping but subtly different concepts. Fuzzy logic is the most common technique for dealing with vagueness.

Uncertainty: These are precise concepts with uncertain values. For example, a patient might present a set of symptoms which correspond to a number of different distinct diagnoses each with a different probability. Probabilistic reasoning techniques are generally employed to address uncertainty.

Inconsistency: These are logical contradictions which will inevitably arise during the development of large ontologies, and when ontologies from separate sources are combined. Deductive reasoning fails catastrophically when faced with inconsistency, because "anything follows from a contradiction" . Defeasible reasoning and paraconsistent reasoning are two techniques which can be employed to deal with inconsistency.

Deceit: This is when the producer of the information is intentionally misleading the consumer of the information. Cryptography techniques are currently utilized to ameliorate this threat.

15


Noisy Data: Omnipotent Being

Proposition 1 Web data is noisy.

Proof: 08445a31a78661b5c746feff39a9db6e4e2cc5cf

sha1-sum of ‘mailto:’ common value for foaf:mbox_sha1sum

An inverse-functional (uniquely identifying) property!!! Any person who shares the same value will be considered the same

Q.E.D.

16


More Proof:

From http://www.eiao.net/rdf/1.0<owl:Property rdf:about="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">

<rdfs:label xml:lang="en">type</rdfs:label>

<rdfs:comment xml:lang="en">Type of resource</rdfs:comment>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#testRun"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#pageSurvey"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#siteSurvey"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#scenario"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#rangeLocation"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#startPointer"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#endPointer"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#header"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#runs"/>

</owl:Property>

Ontology hijacking!!

Noisy Data: Redefining Everything …and home in time for tea

17


The Web……forecast is for muck

18


(Briefly) Why use a rule based approach?…as opposed to a Description Logics based approach

Massive A-Box (i.e., instance data)

Inconsistencies galore

Publishing errors / Messy data

Popular Web ontologies are fairly inexpressive

Web Reasoning: Use Rules!

19


Forward Chaining materialisation:

Avoid runtime expense of backward-chaining– Users taught impatience by Google

Pre-compute answers for quick retrieval

Web-scale systems should be scalable!– More data = more disk space AND/OR more machines

Web Reasoning: Forward Chaining!

20


“Standard” RDFS OWL 2 RL (W3C Rec: 27 Oct. 2009)

“Non-standard” DLP pD* (OWL Horst) OWL–

OWL 2 RL first standard OWL rule expressible “fragment”! More inclusive than previous non-standard OWL rule fragments Includes RDFS rules Includes rule support for new OWL 2 constructs

…although I don’t know of any OWL 2 data on the Web

What rules?

21


Okay, so let’s do forward-chaining OWL 2 RL on billions of triples collected from the Web…

foaf:mbox_sha1sum a owl:InverseFunctionalProperty .

?x foaf:mbox_sha1sum 08445a31a78661b5c746feff39a9db6e4e2cc5cf .

OWL 2 RL rule prp-ifp: ?p a owl:InverseFunctionalProperty . ?x1 ?p ?z . ?x2 ?p ?z .

⇒ ?x1 owl:sameAs ?x2 .

106 ?x1/?x2 bindings in body 1012 inferred pair-wise and reflexive owl:sameAs statements

…or in simpler terms:

pow!

22


Okay, so let’s do forward-chaining OWL 2 RL on billions of triples collected from the Web…

OWL 2 RL rule eq-ref: ?s ?p ?o . ⇒ ?s owl:sameAs ?s . ?p owl:sameAs ?p . ?o owl:sameAs ?o .

Adds |T| triples, where T is the set of RDF terms in the data Could be easily supported by backward-chaining/query rewriting Boring

23


SAOR: Scalable Authoritative OWL Reasoner

Goals:Scalability

Separate TBox (schema) data Incomplete reasoning!

Reduced Output Incomplete reasoning!

Web tolerance Consider provenance of Web data Incomplete reasoning!

24


Scalable Reasoning: In-mem T-Box

Main optimisation: Store T-Box in memory T-Box: (loosely) data describing classes and

properties By far, the most commonly accessed segment of

data for reasoning Quite small (1-2%)

e.g. from a 100M statement Web crawl A-Box: 3,753,791 X ?s foaf:name ?o .

vs. T-Box: <20 X foaf:name ?p ?o . + ?s ?p foaf:name .

25


Scan 1: Scan input data separate T-Box statements, load T-Box statements into memory

Scan 2: Scan all on-disk data, join with in-memory T-Box.

Scalable Reasoning: Scans

26


......

...

...

......

... ...

...

...ex:me ex:presented ex:FridayTalk...

...ex:FridayTalk ex:presentedBy ex:me .ex:me rdf:type foaf:Person .ex:me rdf:type foaf:Agent ....

IN-MEM T-BOX

ON-DISK A-BOX

ON-DISK OUTPUT

ex:presented

ex:presentedBy

owl:inverseOf

foaf:Person

rdfs:domain

foaf:Agent

rdfs:subClassOf

Execution of three rules:

OWL 2 RL rule prp-inv1?p1 owl:inverseOf ?p2 . ?x ?p1 ?y . ?y ?p⇒ 2 ?x .

OWL 2 RL rule prp-dom?p rdfs:domain ?c . ?x ?p ?y . ?x a ?c .⇒

OWL 2 RL rule cax-sco?c1 rdfs:subClassOf ?c2 . ?x a ?c1 . ?x a ?c⇒ 2 .

Scalable Reasoning: No A-Box Joins

27


However: some rules do require A-Box joins ?p a owl:TransitiveProperty . ?x ?p ?y . ?y ?p z .

» ⇒ ?x ?p ?z .

Difficult to engineer a scalable solution (which reaches a fixpoint) No A-Box joins for SAOR reasoning over 1B statements ~99% of inferences over Web data possible without A-Box joins

48/76 OWL 2 RL rules don’t require A-Box joins Side note: No RDFS rule requires A-Box joins And rules cover ~most of what current Web ontologies use

Scalable Reasoning: A-Box joins?

28


T-Box only!

Document D authoritative for class/property X iff: X is a blank-node

– OR De-referenced URI of X coincides with or redirects to D

FOAF spec authoritative for foaf:Person ✓ MY spec not authoritative for foaf:Person ✘

Only allow extension in authoritative documents my:Person rdfs:subClassOf foaf:Person . (MY spec) ✓

BUT: Reduce obscure memberships foaf:Person rdfs:subClassOf my:Person . (MY spec) ✘

ALSO: Protect specifications foaf:mbox rdf:type owl:SymmetricProperty . (MY spec) ✘

Similarly for other T-Box statements.

In-memory T-Box stores authoritative information for rule execution

Greatly reduces output size!!! Compatible with FOAF, SIOC, DC (common Web vocabulary etiquette)

Authoritative Reasoning

29


More Proof:

From http://www.eiao.net/rdf/1.0<owl:Property rdf:about="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">

<rdfs:label xml:lang="en">type</rdfs:label>

<rdfs:comment xml:lang="en">Type of resource</rdfs:comment>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#testRun"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#pageSurvey"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#siteSurvey"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#scenario"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#rangeLocation"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#startPointer"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#endPointer"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#header"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#runs"/>

</owl:Property>

Ontology hijacking!!

Noisy Data: Redefining Everything …revisited

Not authoritative!!!!

30


Distributed Reasoning

More recently performed reasoning over a cluster of commodity hardware

Ran the “easy” OWL 2 RL rules (no A-Box joins) Duplicate the T-Box to all machines… A-Box can be arbitrarily

distributed… Authoritative (of course) Eight machines, 4GB main memory, 2.2 GHz 1.192b input statements crawled last month, pre-distributed

over the machines Reasoning in 113 minutes

– Extract T-Box 16 mins– Aggregate, perform authoritative analysis, broadcast T-Box: 14 mins– Reasoning over A-Box: 83 mins

Output 570m inferences

31


ex:RR2009Talk ex:presentedBy ex:Aidan .ex:presentedBy owl:inverseOf ex:presented .

…and back again



IMP

LIC

T EX

PLIC

IT


32



deri:Aidan ex:presented deri:FridayTalk .

…but what about…



IMP

LIC

T EX

PLIC

IT

33



ex:Aidan ex:presented ex:FridayTalk .

…and…



IMP

LIC

T EX

PLIC

IT

34


Equality Reasoning: Standard (e.g. OWL 2 RL) rules for OWL equality not great: Saw noisy data earlier Quadratic explosion of inferences for equivalences, with high

duplication Do not solve “synonymous duplicates” query answering problem

Entity Consolidation: Instead use “canonical” identifiers: one term to represent set of

equivalent individuals

Equality Reasoning/Entity Consolidation

35


Need owl:sameAs relations

Explicit owl:sameAs statements: good precision / poor recall Inverse-functional properties: reasonable recall / poor precision

Publishers not aware of inverse-functional semantics of such properties Other rules: very poor recall / ? high precision

Been there, done that…

Most recently, consolidation using explicit owl:sameAs statements 8 machines (as before) 61mins for 1.193bn statements

36


Probabilistic/statistical approach to equality reasoning Identify resources with high probability of being equivalent:

– use semantics in the data (equality reasoning);– use statistics derived from the data; e.g.:

– Two people have same birthday, name and share co-authors…

Identify resources with high probability of being different:– use semantics in the data (inconsistencies?);– again use statistics derived from the data; e.g.:

– Two people have different dates-of-birth and names

Perform “fuzzy” reasoning Leverage links-based analysis from input-data to give inferences “scores” of trustworthiness Depending on results, could be used to, e.g.:

– identify “interesting” inferences for partial-materialisation;– identify “trustworthy” inferences for bypassing noise in Web data.

Must be domain-agnostic/scalable/distributable/give good results for noisy and heterogeneous Web data

Future Work

37


1. Web data is messy

2. Reasoning over Web data is difficult

3. Need incomplete, albeit inclusive reasoning

4. Rule execution optimisations possible through special treatment of terminological data

5. Need to consider the provenance of data

6. OWL 2 RL not immediately suitable to application over Web data

7. Incomplete OWL 2 RL support can be offered using existing technologies, in a scalable and tolerant way

8. Busy year ahead

Conclusion

Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

Technology

Transcript of Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering