Data Sets, Vocabularies and Tools Pablo N. Mendes Freie Universität Berlin 1st year review...

27
Data Sets, Vocabularies and Tools Pablo N. Mendes Freie Universität Berlin 1st year review Luxembourg, December 2011 11/02/11

Transcript of Data Sets, Vocabularies and Tools Pablo N. Mendes Freie Universität Berlin 1st year review...

Data Sets, Vocabularies and Tools

Pablo N. MendesFreie Universität Berlin

1st year reviewLuxembourg, December 2011

11/02/11

18 24 30 366 120

FUBFUB

42 48D4.1 Assembly and maintenance of the PlanetData data set catalogue

D4.1 Assembly and maintenance of the PlanetData data set catalogue D4.2 Best practices on

how to provideself-describing data

D4.2 Best practices on how to provideself-describing data

KITKIT

KITKIT

Work Plan View WP4

UPMUPM

D4.3 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal

D4.3 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal

D4.4 Data quality benchmark datasetD4.4 Data quality benchmark dataset

D4.5 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal

D4.5 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal

Task 4.4Assembly and maintenance of a catalogue of data provisioning tools

Task 4.3Development of best practices for providing self-describing data

Task 4.2Community-driven creation and maintenance of vocabularies

Task 4.1Assembly and maintenance of the PlanetData data set catalogue

18 24 30 366 120

Task 5.1Assembly and maintenance of PlanetData technology catalogue

Task 5.2Development of best practices of large-scale data management infrastructures

D5.3 PlanetData data management toolscatalogue and access portal

D5.3 PlanetData data management toolscatalogue and access portal

EPFLEPFL

42 48

D5.1PlanetData data management toolscatalogue and access portal

D5.1PlanetData data management toolscatalogue and access portal

D5.2 Best practices on how to deploy tools on large-scale infrastructures

D5.2 Best practices on how to deploy tools on large-scale infrastructures

KITKIT

Work Plan View WP5

Summary

WP4

Assembly and maintenance of the PlanetData data set, vocabularies and tools catalogue;

Community-driven creation and maintenance of vocabularies;

Development of best practices;

WP5

Assembly and maintenance of the PlanetData technology catalogue;

Best practices for large-scale data management infrastructure;

Deliverables in Year 1

D 4.1• Data Sets Catalog• Vocabularies Catalog

D 5.1• Data Management Tools Catalog

Data Sets Catalog

• Where to maintain the catalog?

• How to catalog?

• What to catalog?

• How to provide access for humans and machines?

• How to organize a community around the catalog?

Repository: TheDataHub.org

Maintained by Open Knowledge Foundation (OKF) and world-wide open data community

Widely used catalog• Dec 1st 2012: has 2418 datasets, 314 LOD

Features of the portal: • Tagging, Rating, Feedback,

Discussions, Groups

Cataloguing Process

• Planet Data Editor

• Collected a list of new datasets → 49 new entries

• Updated existing entries (537 edits)

• Crowdsourcing: data providers and third parties

• Public call for action to mailing lists, OKFN blog

• Supported the community contributions

• Quality Assurance

• Tools to support cataloguing (validator, auto-complete)

• Joint work with LATC

Catalog Metadata QuickRef

What? package name, title, url tag:lod topic shortname format-*

Who?author || maintainerpublished by producerprovenance metadata license

When?versionlast updated

Why?package description

Where to find?example URIdownloads/dumpsSPARQL endpoint

How much?tripleslinks:* (outlinks)namespace (inlinks)vocab mappings

http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/DataSets/CKANmetainformation

How are datasets described?

Catalog Metadata

Resources:• example URIs• SPARQL endpoint• RDF Dumps• Sitemaps, VoID files

Cataloguing process overview

Catalog Entry Validator

Checks levels of metadata completeness

Step-by-step annotation instructions

Already checks some quality indicatorse.g. availability, provenance, access methods

http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php

CKAN Entry Validator (2)

Auto-completion scripts

For the entries that pass the validator, we can auto-complete metadata with information such as:• Number of triples• Links to other sources• Vocabularies used• Quality indicators

Catalog Access Portal

For machines• CKAN API (continuously improved by OKFN)• VOID descriptions for LOD group (will be

continuously improved in cooperation with LATC)

For humans• LOD Cloud Diagram • State of the LOD Report

LOD Cloud Diagram

LOD Cloud Diagram (zoom in)

State of the LOD Cloud

Triples by domain Links by domain

Domain# of datasets

Triples % (Out-)Links %

Media 25 1,841,852,061 5.82 % 50,440,705 10.01 % Geographic 31 6,145,532,484 19.43 % 35,812,328 7.11 % Government 49 13,315,009,400 42.09 % 19,343,519 3.84 % Publications 87 2,950,720,693 9.33 % 139,925,218 27.76 % Cross-domain 41 4,184,635,715 13.23 % 63,183,065 12.54 % Life sciences 41 3,036,336,004 9.60 % 191,844,090 38.06 % User-generated content

20 134,127,413 0.42 % 3,449,143 0.68 %

295 31,634,213,770 503,998,829

http://www4.wiwiss.fu-berlin.de/lodcloud/state/

State of the LOD Cloud (2)

SPARQL Endpoint: 68.14%RDF Dumps: 39.66%Provide provenance:36.63 %Provide licensing:17.84%

vocabulary use:

Vocabularies Catalog

• Based on BTC Dataset (2.1 billion triples)• Shows vocabulary usage in practice• Executed on a 54 node Hadoop cluster

• Access portal:• Searchable• URI Lookup• Top usage statistics

Hosted at http://vocab.cc

Top Classes per Dataset

Top Properties per Dataset

Vocabularies Catalog

vocab.cc search query results

vocab.cc URI Lookup Results

Tools Catalog

• Initial focus on tools from the consortium

• Currently 15 tools

Entry for Global Sensor Networks (GSN)

Available from planet-data.eu

Tools Description

•Textual description• What is it?• Documentation• Publications• Requirements• License• Contact person/mailing list• Organization• Events

•Tags•Produce•Publish•Consume•Provisioning

Names of Tools in the Catalog

CumulusRDF

D2R

DBpedia Spotlight

GSN (Global Sensor Networks)

Geometry2RDF

LDIF

LDSpider (Linked Data Spider)

LarKC (Large Knowledge Collider)

MonetDB

NOR2O

R2O&ODEMapster

OKKAM

Pubby

R2R

S2O

Silk

Tools Catalog

Related: LATC Tools Catalog• 11 tools• 5 tools in both, 10 new tools in PlanetData

Proposal for next year:• Join catalogs at linkeddata.org• Jointly maintain catalog until LATC finishes• Build a community → people can add their

own tools• Afterwards PlanetData takes over and

maintains the catalog for another 2 years