Microtask Crowdsourcing Applications for Linked Data

34
Microtask Crowdsourcing Applications for Linked Data

description

 

Transcript of Microtask Crowdsourcing Applications for Linked Data

Page 1: Microtask Crowdsourcing Applications for Linked Data

Microtask Crowdsourcing

Applications for Linked Data

Page 2: Microtask Crowdsourcing Applications for Linked Data

2

Architecture of Linked Data Applications

SPARQL EndpointsWeb Data accessed via APIs

Data Tier

RDF/ XML

Integrated Dataset

Interlinking CleansingData Access Component

Linked DataEUCLID – Microtask crowdsourcing

applications for Linked Data

Relational Data

Vocabulary Mapping

Logic Tier

Presentation Tier

Data Integration Component

Republication Republication Component

SPARQL Wr. R2R Transf. LD WrapperPhysical Wrapper

Page 3: Microtask Crowdsourcing Applications for Linked Data

3

CH 2

Data Integration Component

• Consolidates the data retrieved from heterogeneous sources.

• This component may operate at:– Schema level: Performs vocabulary mappings in order to translate

data into a single unified schema. Links correspond to RDFS properties or OWL property and class axioms.

– Instance level: Performs entity linking, e.g., entity resolution via owl:sameAs links CH 3

Data Tier

Interlinking CleansingData Access Component

Vocabulary Mapping

Data Integration Component

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 4: Microtask Crowdsourcing Applications for Linked Data

4

Data Integration Component

The data integration component can be enhanced by including microtask crowdsourcing apporaches:

• Cleansing or data assessments: Assessment of DBpedia triples

• Vocabulary mapping: CrowdMAP

• Interlinking: ZenCrowd

Data Tier (2)

Interlinking CleansingData Access Component

Vocabulary Mapping

Data Integration Component

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 5: Microtask Crowdsourcing Applications for Linked Data

5

Other Crowdsourcing-based Solutions for Linked Data Tasks

• Query understanding: CrowdDQ

• Ontology population: OntoGame

• Linked Data curation: Urbanopoly

• …

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 6: Microtask Crowdsourcing Applications for Linked Data

DBPEDIA QUALITY ASSESSMENT

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 7: Microtask Crowdsourcing Applications for Linked Data

Assessing DBpedia Triples

1. Selecting LD quality issues generated by erroneous extraction mechanisms and that can be detected by the crowd

2. Selecting the appropriate crowdsourcing approaches

3. Designing and generating the interfaces to present the data to the crowd

Dataset{s p o .}

{s p o .}

Correct

Incorrect +Quality issue

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 8: Microtask Crowdsourcing Applications for Linked Data

Three categories of quality problems occur pervasively in DBpedia [Zaveri2013]

and can be crowdsourced:

• Incorrect object Example: dbpedia:Dave_Dobbyn dbprop:dateOfBirth “3”.

• Incorrect data type Example: dbpedia:Torishima_Izu_Islands foaf:name “鳥島” @en.

• Incorrect link to “external Web pages” Example: dbpedia:John-Two-Hawks dbpedia-owl:wikiPageExternalLink

<http://cedarlakedvd.com/>

Selecting LD Quality Issues to Crowdsource

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 9: Microtask Crowdsourcing Applications for Linked Data

Selecting Appropriate Crowdsourcing Approaches

ContestLD ExpertsDifficult taskFinal prize

Find Verify

MicrotasksWorkersEasy taskMicropayments

TripleCheckMate [Kontoskostas2013] MTurk

Adapted from [Bernstein2010]

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 10: Microtask Crowdsourcing Applications for Linked Data

Presenting the Data to the Crowd

• Selection of foaf:name or rdfs:label to extract human-readable descriptions

• Real object values extracted automatically from Wikipedia infoboxes

• Link to the Wikipedia article via foaf:isPrimaryTopicOf

• Preview of external pages by implementing HTML iframe

Microtask interfaces: MTurk tasksIncorrect object

Incorrect data type

Incorrect outlink

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 11: Microtask Crowdsourcing Applications for Linked Data

11

Results

Object values Data types Interlinks

Linked Data experts

0.7151 0.8270 0.1525

MTurk (majority voting)

0.8977 0.4752 0.9412

• Both forms of crowdsourcing can be applied to detect certain LD quality issues

• The effort of LD experts must be applied on those tasks demanding specific-domain skills

• MTurk crowd are exceptionally good at performing comparison of data entries

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 12: Microtask Crowdsourcing Applications for Linked Data

ZENCROWD

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 13: Microtask Crowdsourcing Applications for Linked Data

13

ZenCrowd: Entity Linking by the Crowd

• Combine both algorithmic and manual linking• Automate manual linking via crowdsourcing• Dynamically assess human workers with a

probabilistic reasoning framework

Crowd

AlgorithmsMachines

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 14: Microtask Crowdsourcing Applications for Linked Data

14

http://dbpedia.org/resource/Facebook

http://dbpedia.org/resource/Instagram

fbase:Instagramowl:sameAs

Google

Android

<p>Facebook is not waiting for its initial public offering to make its first big purchase.</p><p>In its largest acquisition to date, the social network has purchased Instagram, the popular photo-sharing application, for about $1 billion in cash and stock, the company said Monday.</p>

<p><span about="http://dbpedia.org/resource/Facebook"><cite property=”rdfs:label">Facebook</cite> is not waiting for its initial public offering to make its first big purchase.</span></p><p><span about="http://dbpedia.org/resource/Instagram">In its largest acquisition to date, the social network has purchased <cite property=”rdfs:label">Instagram</cite> , the popular photo-sharing application, for about $1 billion in cash and stock, the company said Monday.</span></p>

RDFa enrichment

HTML:

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 15: Microtask Crowdsourcing Applications for Linked Data

15

ZenCrowd Architecture

Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking. In: 21st International Conference on World Wide Web (WWW 2012).

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 16: Microtask Crowdsourcing Applications for Linked Data

16

Entity Factor Graphs

• Graph components– Workers, links, clicks– Prior probabilities– Link Factors– Constraints

• Probabilistic Inference– Select all links with

posterior prob >τ 2 workers, 6 clicks, 3 candidate links

Link priors

Workerpriors

Observedvariables

Linkfactors

SameAsconstraints

DatasetUnicityconstraints

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 17: Microtask Crowdsourcing Applications for Linked Data

17

Lessons Learnt• Crowdsourcing + Prob reasoning works!• But

– Different worker communities perform differently– Many low quality workers– Completion time may vary (based on reward)

• Need to find the right workers for your task (see WWW13 paper)

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 18: Microtask Crowdsourcing Applications for Linked Data

18

ZenCrowd Summary

• ZenCrowd: Probabilistic reasoning over automatic and crowdsourcing methods for entity linking

• Standard crowdsourcing improves 6% over automatic• 4% - 35% improvement over standard crowdsourcing• 14% average improvement over automatic approaches

• Follow up-work (VLDBJ):– Also used for instance matching across datasets– 3-way blocking with the crowd

http://exascale.info/zencrowd/

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 19: Microtask Crowdsourcing Applications for Linked Data

CROWDQ – CROWD-POWERED QUERY UNDERSTANDING

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 20: Microtask Crowdsourcing Applications for Linked Data

20

Motivation

• Web Search Engines can answer simple factual queries directly on the result page

• Users with complex information needs are often unsatisfied

• Purely automatic techniques are not enough

• We want to solve it with Crowdsourcing!

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 21: Microtask Crowdsourcing Applications for Linked Data

21

CrowdQ• CrowdQ is the first system that uses

crowdsourcing to– Understand the intended meaning– Build a structured query template– Answer the query over Linked Open Data

Gianluca Demartini, Beth Trushkowsky, Tim Kraska, and Michael Franklin. CrowdQ: Crowdsourced Query Understanding. In: 6th Biennial Conference on Innovative Data Systems Research (CIDR 2013).

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 22: Microtask Crowdsourcing Applications for Linked Data

22

Page 23: Microtask Crowdsourcing Applications for Linked Data

23

CrowdQ ArchitectureOff-line: query template generation with the help of the crowdOn-line: query template matching using NLP and search over open data

Page 24: Microtask Crowdsourcing Applications for Linked Data

24

Hybrid Human-Machine Pipeline

Q= birthdate of actors of forrest gump

Query annotation Noun Noun Named entity

Verification

Entity Relations

Is forrest gump this entity in the query?

Which is the relation between: actors and forrest gump starring

Schema element Starring <dbpedia-owl:starring>

Verification Is the relation between:Indiana Jones – Harrison FordBack to the Future – Michael J. Foxof the same type asForrest Gump – actors

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 25: Microtask Crowdsourcing Applications for Linked Data

25

Structured query generation

SELECT ?y ?xWHERE { ?y <dbpedia-owl:birthdate> ?x .

?z <dbpedia-owl:starring> ?y .?z <rdfs:label> ‘Forrest Gump’ }

Results from BTC09:

Q= birthdate of actors of forrest gumpMOVIE

MOVIE

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 26: Microtask Crowdsourcing Applications for Linked Data

CROWDMAP & OTHERS

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 27: Microtask Crowdsourcing Applications for Linked Data

27

CrowdMAP

• Experiments using MTurk, CrowdFlower and established benchmarks• Enhancing the results of automatic techniques• Fast, accurate, cost-effective [Sarasua, Simperl, Noy,

ISWC2012]

CartP301-304

100R50PEdas-Iasted

100R50PEkaw-Iasted

100R50PCmt-Ekaw

100R50PConfOf-Ekaw

Imp301-304

PRECISION 0.53 0.8 1.0 1.0 0.93 0.73

RECALL 1.0 0.42 0.7 0.75 0.65 1.0

Page 28: Microtask Crowdsourcing Applications for Linked Data

10.04.2023 28

Taste IT! Try IT!

• Restaurant review Android app developed in the Insemtives project• Uses Dbpedia concepts to generate structured reviews• Uses mechanism design/gamification to configure incentives• User study

– 2274 reviews by 180 reviewers referring to 900 restaurants, using 5667 DPpedia concepts

https://play.google.com/store/apps/details?id=insemtives.android&hl=en

CAFE FASTFOOD PUB RESTAURANT0

500

1000

1500

2000

2500

Numer of reviewsNumber of semantic annotations (type of cuisine)Number of semantic annotations (dishes)

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 29: Microtask Crowdsourcing Applications for Linked Data

10.04.2023 29

LODrefine

http://research.zemanta.com/crowds-to-the-rescue/EUCLID – Microtask crowdsourcing

applications for Linked Data

Page 30: Microtask Crowdsourcing Applications for Linked Data

10.04.2023 30

Ontology Population

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 31: Microtask Crowdsourcing Applications for Linked Data

31

Linked Data Curation

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 32: Microtask Crowdsourcing Applications for Linked Data

10.04.2023 32

Problems and Challenges• What is feasible and how can tasks be optimally translated into microtasks?

– Examples: data quality assessment for technical and contextual features; subjective vs objective tasks (also in modeling); open-ended questions

• What to show to users– Natural language descriptions of Linked Data/SPARQL– How much context– What form of rendering– How about links?

• How to combine with automatic tools– Which results to validate

• Low precision (no fun for gamers...)• Low recall (vs all possible questions)

• How to embed it into an existing application– Tasks are fine granular, perceived as additional burden to the actual functionality

• What to do with the resulting data?– Integration into existing practices– Vocabularies!

EUCLID – Microtask crowdsourcing applications for Linked Data

Page 33: Microtask Crowdsourcing Applications for Linked Data

10.04.2023

Web site: https://sites.google.com/site/microtasktutorial/

SLIDES and EXERCISES: https://github.com/maribelacosta/crowdsourcing-tutorial

Full-day tutorial ISWC2013Sydney Australia

33EUCLID – Microtask crowdsourcing applications for Linked Data

Page 34: Microtask Crowdsourcing Applications for Linked Data

34

For exercises, quiz and further material visit our website:

@euclid_project euclidproject euclidproject

http://www.euclid-project.eu

Other channels:

eBook Course

EUCLID – Microtask crowdsourcing applications for Linked Data