Crowdsourcing-enabled Linked Data management architecture

22
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Institute of Applied Informatics and Formal Description Methods (AIFB) Institute of Applied Informatics and Formal Description Methods (AIFB) A semantically enabled architecture for crowdsourced Linked Data management Elena Simperl, 1 Maribel Acosta , 1 Barry Norton 2 1 Institute AIFB, Karlsruhe Institute of Technology, Germany 2 Ontotext AD, Bulgaria

description

Crowdsourcing with and for Linked Data at the CrowdSearch workshop @WWW2012

Transcript of Crowdsourcing-enabled Linked Data management architecture

Page 1: Crowdsourcing-enabled Linked Data management architecture

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu

Institute of Applied Informatics and Formal Description Methods (AIFB)

Institute of Applied Informatics and Formal Description Methods (AIFB)

A semantically enabled architecture for crowdsourced Linked Data management Elena Simperl,1 Maribel Acosta,1 Barry Norton2

1Institute AIFB, Karlsruhe Institute of Technology, Germany 2Ontotext AD, Bulgaria

Page 2: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

2 07.06.2012

Background: What is Linked Data?

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

Linked Data: set of best practices to publish and connect structured data on the Web.

URIs to identify entities and concepts in the world HTTP to access and retrieve resources and descriptions of these resources RDF as generic graph-based data model to structure and link data

Taken together Linked Data is said to form a ‘cloud’ of shared references and vocabularies. Query language: SPARQL.

http://linkeddata.org/faq

Page 3: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

3 07.06.2012

Background: Why Linked Data?

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

Google, Yahoo, Bing & schema.org: enhanced search

Data.gov & public sector information: more transparency and accountability in governance

BBC & media: added value of content through interlinking

Page 4: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

4 07.06.2012

Outline

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

• Motivation 1 • Our Approach 2 • Extensions to VoID and SPARQL 3 • Crowdsourced query processing tasks 4 • Advantages 5 • Challenges 6

Page 5: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

5 07.06.2012

1. Motivation

„Retrieve the labels in German of commercial airports located in Baden-Württemberg, ordered by the better human-readable description of the airport given in the comment“.

This query cannot be optimally answered automatically:

Incorrect/missing classification of entities (e.g. classification as airports instead of commercial airports).

Missing information in data sets (e.g. German labels).

It is not possible to optimally perform subjective operations (e.g. comparisons of pictures or NL comments).

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

User Query: Give me the German names of all commercial airports in Baden-Württemberg, ordered by their most informative description.

Page 6: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

6 07.06.2012

1. Motivation

„Retrieve the labels in German of commercial airports located in Baden-Württemberg, ordered by the better human-readable description of the airport given in the comment“.

In order to answer the query as intended:

Classification of airports as commercial airports.

Identity resolution of places (Baden-Württemberg).

Translation of the labels of the airports.

Ordering of the comments by a subjective comparison.

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

Page 7: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

7 07.06.2012

1. Motivation

„Retrieve the labels in German of commercial airports located in Baden-Württemberg, ordered by the better human-readable description of the airport given in the comment“.

SPARQL Query: SELECT ?label WHERE { ?x a metar:CommercialHubAirport; rdfs:label ?label; rdfs:comment ?comment . ?x geonames:parentFeature ?z . ?z owl:sameAs <http://dbpedia.org/resource/Baden-Wuerttemberg> . FILTER (LANG(?label) = "de") } ORDER BY CROWD(?comment, "Better description of %x")

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

1

2

3 4

Classification

Identity Resolution

Missing Information Ordering

Page 8: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

8 07.06.2012

1. Motivation: Our Aim

SPARQL query engine, able to process queries using seamless combination of automatic query processing and crowdsourcing.

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

Query parsing

SPARQL query engine Crowdsourced query processing

Task design UI generation

Query optimization

Query execution

Query Results Mediator

Wrapper Wrapper Wrapper Wrapper

Page 9: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

9 07.06.2012

2. Our Approach

Parser

Decomposes the input query.

Selects the data sets that should be accessed to produce answers.

Rewrites the query into the internal structures.

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

Query parsing

SPARQL query engine

Query Results

Query optimization

Query execution

Page 10: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

10 07.06.2012

2. Our Approach

Optimizer

DB statistics and crowdsourcing statistics: estimated time to completion, and other information about the performance (quality, cost) of the crowd.

Traditional data bases optimization techniques are implemented.

Determines which parts of the query should be solved by human input: VoID and SPARQL extensions.

Generates logical and physical plans.

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

Query parsing

SPARQL query engine

Query Results

Query optimization

Query execution

Page 11: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

11 07.06.2012

2. Our Approach

Executor

Implements physical operators.

Invokes crowdsourcing component:

Creates tasks.

Generates UI.

Infers facts automatically.

Executes query against Linked Data: computational tasks.

Incorporates results from the human input.

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

Query parsing

SPARQL query engine

Query Results

Query optimization

Query execution

Page 12: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

13 07.06.2012

3. Extensions to VoID and SPARQL

The RDF based schema to describe data sets is VoID (Vocabulary of Interlinked Datasets).

Common VoID predicates: voidDataset, void:inDataset, void:Linkset, void:linkPredicate, void:target.

VoID extensions:

Automatic interlinking of datasets CrowdClass CrowdProperty

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

Page 13: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

14 07.06.2012

3. Extensions to VoID and SPARQL

Automatic interlinking of data sets Example - Specification of Data Sets:

:METAR rdf:type void:Dataset . :Genonames rdf:type void:Dataset . :METAR2Geonames rdf:type void:Linkset ; void:linkPredicate owl:sameAs ; void:target :METAR ; void:target :Geonames .

Geonames

METAR

owl:sameAs

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

Page 14: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

15 07.06.2012

3. Extensions to VoID and SPARQL

CrowdClass - Specifies which entities of a data set could be crowdsourced. - All subclasses of the crowdClass are also defined (implicitly)

as crowdsourced entities.

Example:

metar:Airport void:inDataset :METAR .

metar:CommercialHubAirport void:inDataset :METAR;

rdfs:subClass metar:Airport .

metar:Airport rdf:type void:crowdClass .

metar:CommercialHubAirport rdf:type void:crowdClass.

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

Page 15: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

16 07.06.2012

3. Extensions to VoID and SPARQL

RDF data can be queried using the language SPARQL.

Common SPARQL operators: join, union, optional, filter, order by.

Properties related to general ontology languages such as OWL are treated as extensions of SPARQL operators, and are modeled in our architecture as tasks.

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

Page 16: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

17 07.06.2012

4. Tasks

Formal, declarative description of the data and tasks using SPARQL patterns as a basis for the automatic design of HITs.

Identity resolution

Missing information

Ontological classification

Ordering (new operator)

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced

Linked Data management

Page 17: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

18 07.06.2012

4.1. Ontological Classification

It is not always possible to automatically infer classification from the properties. Example: Retrieve the names (labels) of METAR stations that

correspond to commercial airports.

SELECT ?label WHERE { ?station a metar:CommercialHubAirport; rdfs:label ?label .}

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

{?station a metar:Station; rdfs:label ?label; wgs84:lat ?lat; wgs84:long ?long}

Input:

{?station a ?type. ?type rdfs:subClassOf metar:Station}

Output:

Page 18: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

19 07.06.2012

4.2. Ordering

Orderings defined via less straightforward built-ins; for instance, the ordering of pictorial representations of entities. SPARQL extension: ORDER BY CROWD Example: Retrieves all airports and their pictures, and the pictures should

be ordered according to the more representative image of the given airport.

SELECT ?airport ?picture WHERE { ?airport a metar:Airport; foaf:depiction ?picture . } ORDER BY CROWD(?picture, "Most representative image for %airport")

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

{?airport foaf:depiction ?x, ?y} Input:

{{(?x ?y) a rdf:List} UNION {(?y ?x) a rdf:List}} Output:

Page 19: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

20 07.06.2012

4.3. Computational tasks expressed as SPARQL queries

Transitive relations inferred automatically, without requiring human intervention.

Implementation of restrictions in SPIN.

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

Identity Resolution Classification Ordering CONSTRUCT { ?a owl:sameAs ?c . } WHERE { ?a owl:sameAs ?b . ?b owl:sameAs ?c . }

CONSTRUCT { ?a a ?b. ?b rdfs:subClassOf ?c. } WHERE { ?a rdfs:subClassOf ?c. ?b rdfs:subClassOf ?b1. ?b1 rdfs:subClassOf ?c. }

CONSTRUCT { {(?a ?b) a rdf:List .} } WHERE { (?a ?x) a rdf:List . (?x ?b) a rdf:List . }

Page 20: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

21 07.06.2012

5. Advantages

Declarative description of data allows to decompose the query.

Generation of the UIs automatically.

Generation of human tasks on-the-fly and adjustment of the design of the task.

Automatic consistency check of results by reasoning against validating ontology.

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

Page 21: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

22 07.06.2012

6. Challenges

Appropriate level of granularity for HITs design for specific SPARQL constructs.

Caching Naively we can materialise HIT results into datasets.

How to deal with partial coverage and dynamic datasets.

Optimal user interfaces of graph-like content.

Pricing and workers’ assignment.

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

Page 22: Crowdsourcing-enabled Linked Data management architecture

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

23 07.06.2012

QUESTIONS

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management