Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

37
Copyright 2009 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute www.deri.i e 0:55:04 1 Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering Mini-Viva Aidan Hogan 12 th February, 2010

description

Given at DERI Galway on 2010/02/12.More and more data is being published on the Web through RDF; in particular – and largely under the auspices of the pragmatic Linked Data community – more and more structured data is being published within a plethora of different domains: e.g., information is being published from Wikipedia, BBC, LastFM, the UK government, etc., describing people, organisations, online communities, movies, proteins, and so forth. In this talk, I will present research I have carried out during my PhD which aims at enhancing large-scale RDF web datasets for query-answering: I will show some simple examples of problems with query-answering over the “native” RDF data, and discuss pragmatic solutions – in light of these examples – which involve using a scalable and best-effort reasoning approach. I will also discuss open questions and future directions along the lines of the above topic.

Transcript of Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

Page 1: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

Copyright 2009 Digital Enterprise Research Institute. All rights reserved.

Digital Enterprise Research Institute www.deri.ie

0:55:04 1

Enhancing Large-Scale RDF Web Knowledge-bases for Query

Answering

Mini-VivaAidan Hogan

12th February, 2010

Page 2: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

2

Digital Enterprise Research Institute www.deri.ie

Overview

Fig 1: RDF Web Dataset

explicit data

implicit data

Topic of today’s talk: How to exploit implicit

data for “query answering”

Page 3: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

3

Digital Enterprise Research Institute www.deri.ie

Query Answering…

…over RDF Web data…(Linked Data if you prefer)

Search engines such as SWSE, Sindice, Falcons, Swoogle, Watson etc.

SPARQL endpoints over Web data such as YARS2, Virtuoso, etc.

Page 4: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

4

Digital Enterprise Research Institute www.deri.ie

ex:Aidan ex:presented ex:RR2009Talk .

deri:Aidan ex:presented deri:FridayTalk .

Problem: Synonymous Omissions

Query:Give me all talks presented by Aidan

ex:Aidan ex:presented ?talk .

IMP

LIC

T EX

PLIC

IT

Page 5: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

5

Digital Enterprise Research Institute www.deri.ie

ex:Aidan ex:presented deri:FridayTalk .

ex:Aidan ex:presented ex:FridayTalk .

Problem: Synonymous Duplicates

Query:Give me all talks presented by Aidan

ex:Aidan ex:presented ?talk .

IMP

LIC

T EX

PLIC

IT

Page 6: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

6

Digital Enterprise Research Institute www.deri.ie

ex:RR2009Talk ex:presentedBy ex:Aidan .

Problem: Incomplete Answers

Query:Give me all talks presented by Aidan

ex:Aidan ex:presented ?talk .

IMP

LIC

T EX

PLIC

IT

Page 7: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

7

Digital Enterprise Research Institute www.deri.ie

Solution: Publish Complete Data?

Query:Give me all talks presented by Aidan

ex:Aidan ex:presented ?talk .

IMP

LIC

T EX

PLIC

IT

ex:RR2009Talk ex:presentedBy ex:Aidan .

ex:Aidan ex:presented ex:RR2009Talk .

Page 8: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

8

Digital Enterprise Research Institute www.deri.ie

ex:RR2009Talk ex:presentedBy ex:Aidan .

Solution: Write Query in many ways?

Query:Give me all talks presented by Aidan

ex:Aidan ex:presented ?talk .?talk ex:presentedBy ex:Aidan .

IMP

LIC

T EX

PLIC

IT

Page 9: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

9

Digital Enterprise Research Institute www.deri.ie

deri:Aidan ex:presented deri:FridayTalk .deri:Aidan owl:sameAs ex:Aidan .ex:RR2009Talk ex:presentedBy ex:Aidan .ex:presentedBy owl:inverseOf ex:presented .

ex:Aidan ex:presented deri:FridayTalk .

ex:Aidan ex:presented ex:RR2009Talk .

Solution: Exploit OWL and RDFS…

Query:Give me all talks presented by Aidan

ex:Aidan ex:presented ?talk .

IMP

LIC

T EX

PLIC

IT

Page 10: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

10

Digital Enterprise Research Institute www.deri.ie

OWL / RDFS

(loosely) Define the semantics of classes and properties… (define relationships between terms)

ex:presentedBy owl:inverseOf ex:presented .

ex:presentedBy rdfs:domain ex:Talk .

ex:presentedBy rdfs:range ex:Person .

Define equivalence between individuals (owl:sameAs)

ex:Aidan owl:sameAs deri:Aidan .

Give machines an insight into the meaning of data

Allows for reasoning

Page 11: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

11

Digital Enterprise Research Institute www.deri.ie

Reasoning

(Loosely) Use the semantics of classes and properties—defined in RDFS and OWL—to make implicit knowledge explicit

One approach is using rules: IF condition THEN consequent

?p1 owl:inverseOf ?p2 . ?s ?p1 ?o . => ?o ?p2 ?s .

ex:presentedBy owl:inverseOf ex:presented .ex:FridayTalk ex:presentedBy ex:Aidan .=> ex:Aidan ex:presented ex:FridayTalk .

?p rdfs:domain ?c . ?s ?p ?o . => ?s rdf:type ?c .

?p rdfs:range ?c . ?s ?p ?o . => ?o rdf:type ?c .

ex:presentedBy rdfs:domain ex:Talk .ex:FridayTalk ex:presentedBy ex:Aidan .=> ex:FridayTalk rdf:type ex:Talk .

ex:presentedBy rdfs:range ex:Person .ex:FridayTalk ex:presentedBy ex:Aidan .=> ex:Aidan rdf:type ex:Person .

Page 12: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

12

Digital Enterprise Research Institute www.deri.ie

deri:Aidan ex:presented deri:FridayTalk .deri:Aidan owl:sameAs ex:Aidan .ex:RR2009Talk ex:presentedBy ex:Aidan .ex:presentedBy owl:inverseOf ex:presented .

Reasoning: Make Explicit the Implicit

Query:Give me all talks presented by Aidan

ex:Aidan ex:presented ?talk .

ex:Aidan ex:presented deri:FridayTalk .ex:Aidan ex:presented ex:RR2009Talk .

IMP

LIC

T

EX

PLIC

IT

Page 13: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

13

Digital Enterprise Research Institute www.deri.ie

Web Reasoning: Challenges

Scalability Billions or tens of billions of statements (for the moment)

– Near linear scale!!!

Noisy data Inconsistencies galore Publishing errors “Ontology hijacking”

Page 14: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

14

Digital Enterprise Research Institute www.deri.ie

Web Reasoning: Challenges

Challenges (Semantic Web Wikipedia Article) Some of the challenges for the Semantic Web include vastness, vagueness, uncertainty, inconsistency and deceit. Automated reasoning systems will have to deal

with all of these issues in order to deliver on the promise of the Semantic Web. Vastness: The World Wide Web contains at least 48 billion pages as of this writing (August 2, 2009). The SNOMED CT medical terminology ontology contains 370,000 class names,

and existing technology has not yet been able to eliminate all semantically duplicated terms. Any automated reasoning system will have to deal with truly huge inputs. Vagueness: These are imprecise concepts like "young" or "tall". This arises from the vagueness of user queries, of concepts represented by content providers, of matching query

terms to provider terms and of trying to combine different knowledge bases with overlapping but subtly different concepts. Fuzzy logic is the most common technique for dealing with vagueness.

Uncertainty: These are precise concepts with uncertain values. For example, a patient might present a set of symptoms which correspond to a number of different distinct diagnoses each with a different probability. Probabilistic reasoning techniques are generally employed to address uncertainty.

Inconsistency: These are logical contradictions which will inevitably arise during the development of large ontologies, and when ontologies from separate sources are combined. Deductive reasoning fails catastrophically when faced with inconsistency, because "anything follows from a contradiction" . Defeasible reasoning and paraconsistent reasoning are two techniques which can be employed to deal with inconsistency.

Deceit: This is when the producer of the information is intentionally misleading the consumer of the information. Cryptography techniques are currently utilized to ameliorate this threat.

Page 15: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

15

Digital Enterprise Research Institute www.deri.ie

Noisy Data: Omnipotent Being

Proposition 1 Web data is noisy.

Proof: 08445a31a78661b5c746feff39a9db6e4e2cc5cf

sha1-sum of ‘mailto:’ common value for foaf:mbox_sha1sum

An inverse-functional (uniquely identifying) property!!! Any person who shares the same value will be considered the same

Q.E.D.

Page 16: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

16

Digital Enterprise Research Institute www.deri.ie

More Proof:

From http://www.eiao.net/rdf/1.0<owl:Property rdf:about="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">

<rdfs:label xml:lang="en">type</rdfs:label>

<rdfs:comment xml:lang="en">Type of resource</rdfs:comment>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#testRun"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#pageSurvey"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#siteSurvey"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#scenario"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#rangeLocation"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#startPointer"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#endPointer"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#header"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#runs"/>

</owl:Property>

Ontology hijacking!!

Noisy Data: Redefining Everything …and home in time for tea

Page 17: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

17

Digital Enterprise Research Institute www.deri.ie

The Web……forecast is for muck

Page 18: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

18

Digital Enterprise Research Institute www.deri.ie

(Briefly) Why use a rule based approach?…as opposed to a Description Logics based approach

Massive A-Box (i.e., instance data)

Inconsistencies galore

Publishing errors / Messy data

Popular Web ontologies are fairly inexpressive

Web Reasoning: Use Rules!

Page 19: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

19

Digital Enterprise Research Institute www.deri.ie

Forward Chaining materialisation:

Avoid runtime expense of backward-chaining– Users taught impatience by Google

Pre-compute answers for quick retrieval

Web-scale systems should be scalable!– More data = more disk space AND/OR more machines

Web Reasoning: Forward Chaining!

Page 20: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

20

Digital Enterprise Research Institute www.deri.ie

“Standard” RDFS OWL 2 RL (W3C Rec: 27 Oct. 2009)

“Non-standard” DLP pD* (OWL Horst) OWL–

OWL 2 RL first standard OWL rule expressible “fragment”! More inclusive than previous non-standard OWL rule fragments Includes RDFS rules Includes rule support for new OWL 2 constructs

…although I don’t know of any OWL 2 data on the Web

What rules?

Page 21: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

21

Digital Enterprise Research Institute www.deri.ie

Okay, so let’s do forward-chaining OWL 2 RL on billions of triples collected from the Web…

foaf:mbox_sha1sum a owl:InverseFunctionalProperty .

?x foaf:mbox_sha1sum 08445a31a78661b5c746feff39a9db6e4e2cc5cf .

OWL 2 RL rule prp-ifp: ?p a owl:InverseFunctionalProperty . ?x1 ?p ?z . ?x2 ?p ?z .

⇒ ?x1 owl:sameAs ?x2 .

106 ?x1/?x2 bindings in body 1012 inferred pair-wise and reflexive owl:sameAs statements

…or in simpler terms:

pow!

Page 22: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

22

Digital Enterprise Research Institute www.deri.ie

Okay, so let’s do forward-chaining OWL 2 RL on billions of triples collected from the Web…

OWL 2 RL rule eq-ref: ?s ?p ?o . ⇒ ?s owl:sameAs ?s . ?p owl:sameAs ?p . ?o owl:sameAs ?o .

Adds |T| triples, where T is the set of RDF terms in the data Could be easily supported by backward-chaining/query rewriting Boring

Page 23: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

23

Digital Enterprise Research Institute www.deri.ie

SAOR: Scalable Authoritative OWL Reasoner

Goals:Scalability

Separate TBox (schema) data Incomplete reasoning!

Reduced Output Incomplete reasoning!

Web tolerance Consider provenance of Web data Incomplete reasoning!

Page 24: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

24

Digital Enterprise Research Institute www.deri.ie

Scalable Reasoning: In-mem T-Box

Main optimisation: Store T-Box in memory T-Box: (loosely) data describing classes and

properties By far, the most commonly accessed segment of

data for reasoning Quite small (1-2%)

e.g. from a 100M statement Web crawl A-Box: 3,753,791 X ?s foaf:name ?o .

vs. T-Box: <20 X foaf:name ?p ?o . + ?s ?p foaf:name .

Page 25: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

25

Digital Enterprise Research Institute www.deri.ie

Scan 1: Scan input data separate T-Box statements, load T-Box statements into memory

Scan 2: Scan all on-disk data, join with in-memory T-Box.

Scalable Reasoning: Scans

Page 26: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

26

Digital Enterprise Research Institute www.deri.ie

......

...

...

......

... ...

...

...ex:me ex:presented ex:FridayTalk...

...ex:FridayTalk ex:presentedBy ex:me .ex:me rdf:type foaf:Person .ex:me rdf:type foaf:Agent ....

IN-MEM T-BOX

ON-DISK A-BOX

ON-DISK OUTPUT

ex:presented

ex:presentedBy

owl:inverseOf

foaf:Person

rdfs:domain

foaf:Agent

rdfs:subClassOf

Execution of three rules:

OWL 2 RL rule prp-inv1?p1 owl:inverseOf ?p2 . ?x ?p1 ?y . ?y ?p⇒ 2 ?x .

OWL 2 RL rule prp-dom?p rdfs:domain ?c . ?x ?p ?y . ?x a ?c .⇒

OWL 2 RL rule cax-sco?c1 rdfs:subClassOf ?c2 . ?x a ?c1 . ?x a ?c⇒ 2 .

Scalable Reasoning: No A-Box Joins

Page 27: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

27

Digital Enterprise Research Institute www.deri.ie

However: some rules do require A-Box joins ?p a owl:TransitiveProperty . ?x ?p ?y . ?y ?p z .

» ⇒ ?x ?p ?z .

Difficult to engineer a scalable solution (which reaches a fixpoint) No A-Box joins for SAOR reasoning over 1B statements ~99% of inferences over Web data possible without A-Box joins

48/76 OWL 2 RL rules don’t require A-Box joins Side note: No RDFS rule requires A-Box joins And rules cover ~most of what current Web ontologies use

Scalable Reasoning: A-Box joins?

Page 28: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

28

Digital Enterprise Research Institute www.deri.ie

T-Box only!

Document D authoritative for class/property X iff: X is a blank-node

– OR De-referenced URI of X coincides with or redirects to D

FOAF spec authoritative for foaf:Person ✓ MY spec not authoritative for foaf:Person ✘

Only allow extension in authoritative documents my:Person rdfs:subClassOf foaf:Person . (MY spec) ✓

BUT: Reduce obscure memberships foaf:Person rdfs:subClassOf my:Person . (MY spec) ✘

ALSO: Protect specifications foaf:mbox rdf:type owl:SymmetricProperty . (MY spec) ✘

Similarly for other T-Box statements.

In-memory T-Box stores authoritative information for rule execution

Greatly reduces output size!!! Compatible with FOAF, SIOC, DC (common Web vocabulary etiquette)

Authoritative Reasoning

Page 29: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

29

Digital Enterprise Research Institute www.deri.ie

More Proof:

From http://www.eiao.net/rdf/1.0<owl:Property rdf:about="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">

<rdfs:label xml:lang="en">type</rdfs:label>

<rdfs:comment xml:lang="en">Type of resource</rdfs:comment>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#testRun"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#pageSurvey"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#siteSurvey"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#scenario"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#rangeLocation"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#startPointer"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#endPointer"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#header"/>

<rdfs:domain rdf:resource="http://www.eiao.net/rdf/1.0#runs"/>

</owl:Property>

Ontology hijacking!!

Noisy Data: Redefining Everything …revisited

Not authoritative!!!!

Page 30: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

30

Digital Enterprise Research Institute www.deri.ie

Distributed Reasoning

More recently performed reasoning over a cluster of commodity hardware

Ran the “easy” OWL 2 RL rules (no A-Box joins) Duplicate the T-Box to all machines… A-Box can be arbitrarily

distributed… Authoritative (of course) Eight machines, 4GB main memory, 2.2 GHz 1.192b input statements crawled last month, pre-distributed

over the machines Reasoning in 113 minutes

– Extract T-Box 16 mins– Aggregate, perform authoritative analysis, broadcast T-Box: 14 mins– Reasoning over A-Box: 83 mins

Output 570m inferences

Page 31: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

31

Digital Enterprise Research Institute www.deri.ie

ex:RR2009Talk ex:presentedBy ex:Aidan .ex:presentedBy owl:inverseOf ex:presented .

…and back again

Query:Give me all talks presented by Aidan

ex:Aidan ex:presented ?talk .

IMP

LIC

T EX

PLIC

IT

ex:Aidan ex:presented ex:RR2009Talk .

Page 32: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

32

Digital Enterprise Research Institute www.deri.ie

ex:Aidan ex:presented ex:RR2009Talk .

deri:Aidan ex:presented deri:FridayTalk .

…but what about…

Query:Give me all talks presented by Aidan

ex:Aidan ex:presented ?talk .

IMP

LIC

T EX

PLIC

IT

Page 33: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

33

Digital Enterprise Research Institute www.deri.ie

ex:Aidan ex:presented deri:FridayTalk .

ex:Aidan ex:presented ex:FridayTalk .

…and…

Query:Give me all talks presented by Aidan

ex:Aidan ex:presented ?talk .

IMP

LIC

T EX

PLIC

IT

Page 34: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

34

Digital Enterprise Research Institute www.deri.ie

Equality Reasoning: Standard (e.g. OWL 2 RL) rules for OWL equality not great: Saw noisy data earlier Quadratic explosion of inferences for equivalences, with high

duplication Do not solve “synonymous duplicates” query answering problem

Entity Consolidation: Instead use “canonical” identifiers: one term to represent set of

equivalent individuals

Equality Reasoning/Entity Consolidation

Page 35: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

35

Digital Enterprise Research Institute www.deri.ie

Need owl:sameAs relations

Explicit owl:sameAs statements: good precision / poor recall Inverse-functional properties: reasonable recall / poor precision

Publishers not aware of inverse-functional semantics of such properties Other rules: very poor recall / ? high precision

Been there, done that…

Most recently, consolidation using explicit owl:sameAs statements 8 machines (as before) 61mins for 1.193bn statements

Page 36: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

36

Digital Enterprise Research Institute www.deri.ie

Probabilistic/statistical approach to equality reasoning Identify resources with high probability of being equivalent:

– use semantics in the data (equality reasoning);– use statistics derived from the data; e.g.:

– Two people have same birthday, name and share co-authors…

Identify resources with high probability of being different:– use semantics in the data (inconsistencies?);– again use statistics derived from the data; e.g.:

– Two people have different dates-of-birth and names

Perform “fuzzy” reasoning Leverage links-based analysis from input-data to give inferences “scores” of trustworthiness Depending on results, could be used to, e.g.:

– identify “interesting” inferences for partial-materialisation;– identify “trustworthy” inferences for bypassing noise in Web data.

Must be domain-agnostic/scalable/distributable/give good results for noisy and heterogeneous Web data

Future Work

Page 37: Enhancing Large-Scale RDF Web Knowledge-bases for Query Answering

37

Digital Enterprise Research Institute www.deri.ie

1. Web data is messy

2. Reasoning over Web data is difficult

3. Need incomplete, albeit inclusive reasoning

4. Rule execution optimisations possible through special treatment of terminological data

5. Need to consider the provenance of data

6. OWL 2 RL not immediately suitable to application over Web data

7. Incomplete OWL 2 RL support can be offered using existing technologies, in a scalable and tolerant way

8. Busy year ahead

Conclusion