Crowdsourcing-enabled Linked Data management architecture

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu

Institute of Applied Informatics and Formal Description Methods (AIFB)

Institute of Applied Informatics and Formal Description Methods (AIFB)

A semantically enabled architecture for crowdsourced Linked Data management Elena Simperl,1 Maribel Acosta,1 Barry Norton2

1Institute AIFB, Karlsruhe Institute of Technology, Germany 2Ontotext AD, Bulgaria

http://www.aifb.kit.edu/web/Hauptseite/en�

http://www.ontotext.com/�

Institut für Angewandte Informatik und Formale Beschreibungsverfahren (AIFB)

2 07.06.2012

Background: What is Linked Data?

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced Linked Data management

Linked Data: set of best practices to publish and connect structured data on the Web.

URIs to identify entities and concepts in the world HTTP to access and retrieve resources and descriptions of these resources RDF as generic graph-based data model to structure and link data

Taken together Linked Data is said to form a ‘cloud’ of shared references and vocabularies. Query language: SPARQL.

http://linkeddata.org/faq


http://linkeddata.org/faq�


3 07.06.2012

Background: Why Linked Data?


Google, Yahoo, Bing & schema.org: enhanced search

Data.gov & public sector information: more transparency and accountability in governance

BBC & media: added value of content through interlinking



4 07.06.2012

Outline


• Motivation 1 • Our Approach 2 • Extensions to VoID and SPARQL 3 • Crowdsourced query processing tasks 4 • Advantages 5 • Challenges 6



5 07.06.2012

1. Motivation

„Retrieve the labels in German of commercial airports located in Baden-Württemberg, ordered by the better human-readable description of the airport given in the comment“.

This query cannot be optimally answered automatically:

Incorrect/missing classification of entities (e.g. classification as airports instead of commercial airports).

Missing information in data sets (e.g. German labels).

It is not possible to optimally perform subjective operations (e.g. comparisons of pictures or NL comments).


User Query: Give me the German names of all commercial airports in Baden-Württemberg, ordered by their most informative description.



6 07.06.2012

1. Motivation


In order to answer the query as intended:

Classification of airports as commercial airports.

Identity resolution of places (Baden-Württemberg).

Translation of the labels of the airports.

Ordering of the comments by a subjective comparison.




7 07.06.2012

1. Motivation


SPARQL Query: SELECT ?label WHERE { ?x a metar:CommercialHubAirport; rdfs:label ?label; rdfs:comment ?comment . ?x geonames:parentFeature ?z . ?z owl:sameAs <http://dbpedia.org/resource/Baden-Wuerttemberg> . FILTER (LANG(?label) = "de") } ORDER BY CROWD(?comment, "Better description of %x")


1

2

3 4

Classification

Identity Resolution

Missing Information Ordering



8 07.06.2012

1. Motivation: Our Aim

SPARQL query engine, able to process queries using seamless combination of automatic query processing and crowdsourcing.


Query parsing

SPARQL query engine Crowdsourced query processing

Task design UI generation

Query optimization

Query execution

Query Results Mediator

Wrapper Wrapper Wrapper Wrapper



9 07.06.2012

2. Our Approach

Parser

Decomposes the input query.

Selects the data sets that should be accessed to produce answers.

Rewrites the query into the internal structures.


Query parsing

SPARQL query engine

Query Results

Query optimization

Query execution



10 07.06.2012

2. Our Approach

Optimizer

DB statistics and crowdsourcing statistics: estimated time to completion, and other information about the performance (quality, cost) of the crowd.

Traditional data bases optimization techniques are implemented.

Determines which parts of the query should be solved by human input: VoID and SPARQL extensions.

Generates logical and physical plans.


Query parsing

SPARQL query engine

Query Results

Query optimization

Query execution



11 07.06.2012

2. Our Approach

Executor

Implements physical operators.

Invokes crowdsourcing component:

Creates tasks.

Generates UI.

Infers facts automatically.

Executes query against Linked Data: computational tasks.

Incorporates results from the human input.


Query parsing

SPARQL query engine

Query Results

Query optimization

Query execution



13 07.06.2012

3. Extensions to VoID and SPARQL

The RDF based schema to describe data sets is VoID (Vocabulary of Interlinked Datasets).

Common VoID predicates: voidDataset, void:inDataset, void:Linkset, void:linkPredicate, void:target.

VoID extensions:

Automatic interlinking of datasets CrowdClass CrowdProperty




14 07.06.2012


Automatic interlinking of data sets Example - Specification of Data Sets:

:METAR rdf:type void:Dataset . :Genonames rdf:type void:Dataset . :METAR2Geonames rdf:type void:Linkset ; void:linkPredicate owl:sameAs ; void:target :METAR ; void:target :Geonames .

Geonames

METAR

owl:sameAs




15 07.06.2012


CrowdClass - Specifies which entities of a data set could be crowdsourced. - All subclasses of the crowdClass are also defined (implicitly)

as crowdsourced entities.

Example:

metar:Airport void:inDataset :METAR .

metar:CommercialHubAirport void:inDataset :METAR;

rdfs:subClass metar:Airport .

metar:Airport rdf:type void:crowdClass .

metar:CommercialHubAirport rdf:type void:crowdClass.




16 07.06.2012


RDF data can be queried using the language SPARQL.

Common SPARQL operators: join, union, optional, filter, order by.

Properties related to general ontology languages such as OWL are treated as extensions of SPARQL operators, and are modeled in our architecture as tasks.




17 07.06.2012

4. Tasks

Formal, declarative description of the data and tasks using SPARQL patterns as a basis for the automatic design of HITs.

Identity resolution

Missing information

Ontological classification

Ordering (new operator)

CrowdSearch 2012 - A semantically enabled architecture for crowdsourced

Linked Data management



18 07.06.2012

4.1. Ontological Classification

It is not always possible to automatically infer classification from the properties. Example: Retrieve the names (labels) of METAR stations that

correspond to commercial airports.

SELECT ?label WHERE { ?station a metar:CommercialHubAirport; rdfs:label ?label .}


{?station a metar:Station; rdfs:label ?label; wgs84:lat ?lat; wgs84:long ?long}

Input:

{?station a ?type. ?type rdfs:subClassOf metar:Station}

Output:



19 07.06.2012

4.2. Ordering

Orderings defined via less straightforward built-ins; for instance, the ordering of pictorial representations of entities. SPARQL extension: ORDER BY CROWD Example: Retrieves all airports and their pictures, and the pictures should

be ordered according to the more representative image of the given airport.

SELECT ?airport ?picture WHERE { ?airport a metar:Airport; foaf:depiction ?picture . } ORDER BY CROWD(?picture, "Most representative image for %airport")


{?airport foaf:depiction ?x, ?y} Input:

{{(?x ?y) a rdf:List} UNION {(?y ?x) a rdf:List}} Output:



20 07.06.2012

4.3. Computational tasks expressed as SPARQL queries

Transitive relations inferred automatically, without requiring human intervention.

Implementation of restrictions in SPIN.


Identity Resolution Classification Ordering CONSTRUCT { ?a owl:sameAs ?c . } WHERE { ?a owl:sameAs ?b . ?b owl:sameAs ?c . }

CONSTRUCT { ?a a ?b. ?b rdfs:subClassOf ?c. } WHERE { ?a rdfs:subClassOf ?c. ?b rdfs:subClassOf ?b1. ?b1 rdfs:subClassOf ?c. }

CONSTRUCT { {(?a ?b) a rdf:List .} } WHERE { (?a ?x) a rdf:List . (?x ?b) a rdf:List . }



21 07.06.2012

5. Advantages

Declarative description of data allows to decompose the query.

Generation of the UIs automatically.

Generation of human tasks on-the-fly and adjustment of the design of the task.

Automatic consistency check of results by reasoning against validating ontology.




22 07.06.2012

6. Challenges

Appropriate level of granularity for HITs design for specific SPARQL constructs.

Caching Naively we can materialise HIT results into datasets.

How to deal with partial coverage and dynamic datasets.

Optimal user interfaces of graph-like content.

Pricing and workers’ assignment.




23 07.06.2012

QUESTIONS



Crowdsourcing-enabled Linked Data management architecture

Education

Transcript of Crowdsourcing-enabled Linked Data management architecture