Development of guidelines for publishing statistical data ...

33
Development of guidelines for publishing statistical data as linked open data MERGING STATISTICS AND GEOSPATIAL INFORMATION IN MEMBER STATES - POLAND Mirosław Migacz INSPIRE Conference 2016 Barcelona, 26 IX 16

Transcript of Development of guidelines for publishing statistical data ...

Development of guidelines for publishing statistical dataas linked open dataMERGING STATISTICS AND GEOSPATIALINFORMATION IN MEMBER STATES - POLAND

Mirosław MigaczINSPIRE Conference 2016Barcelona, 26 IX 16

Agenda

• project aims,

• introduction to linked open data,

• project timeline,

• project tasks,

• intranet site.

• from ontology do sparql endpoint

Overall objective

Support decision-making processes involving provision of standardized, usable and open georeferenced statistical data.

What is linked open data?

• Internet – collection of documents published online – accessible at Web location identified by a URL,

• Documents mainly human-readable and cannot be understood by machines.

• Linked open data – data machine-readable formats and connecting described using Uniform Resource Identifiers (URIs), thus enabling people and machines to collect the data, and put it together to do all kinds of things with it (permitted by the licence).

source: https://joinup.ec.europa.eu/community/ods/description (CC 2.0)

Linked open data

• URI – for names

• RDF – to describe data

• SPARQL – to query for data

source: https://joinup.ec.europa.eu/community/ods/description (CC 2.0)

Uniform Resource Identifier (URI)to „make a long story short”:

object described by an internet address

A country, e.g. Belgium

http://publications.europa.eu/resource/authority/country/BEL

A dataset, e.g. Countries Named Authority List

http://publications.europa.eu/resource/authority/country/

In official statistics it can look like this:

http://teryt.stat.gov.pl/32/18/05/3 - gmina Węgorzyno

source: https://joinup.ec.europa.eu/community/ods/description (CC 2.0)

RDF i SPARQLResource Description Framework (RDF ) is a syntax for representing data and resources in the Web

RDF breaks every piece of information down in triples:

• Subject – a resource, which may be identified with a URI.

• Predicate – a URI-identified reused specification of the relationship.

• Object – a resource or literal to which the subject is related.

source: https://joinup.ec.europa.eu/community/ods/description (CC 2.0)

http://example.org/place/Brussels is the capital of “Belgia”LUB

http://example.org/place/Brussels is the capital of http://example.org/place/Belgium

subject predicate object

SPARQL is a standardised language for querying RDF data.

Five stars of linked open data

source: https://joinup.ec.europa.eu/community/ods/description (CC 2.0)

Make your stuff available on the Web (whatever format) under an open license.

Make it available as structured data (e.g., Excel instead of image scan of a table)

Use non-proprietary formats (e.g., CSV instead of Excel)

Use URIs to denote things, so that people can point at your stuff

Link your data to other data to provide context

Now

powiatłobeski(LAU 1)

3218

4.4.32.64.18

lobeski

4326418

Aim

powiat łobeski

http://nts.stat.gov.pl/4/4/32/64/18

Specific objectives

• identification of statistical units for which data can be published with harmonization of theirgeometries for respective years

• building standarized URIs for statistical units

• identification and analysis of potential data sources

• plan for transformation of existing data sourcesinto open formats

• creation of RDF metadata for data sources

• feasibility analysis for publishing linked open data

Stage I – until 4/10/2016

• identification of statistical unitsfor which data can be publishedwith harmonization of theirgeometries for respective years

• building standarized URIs for statistical units

• identification and analysis of potential data sources, analyzing for: „openness”, georeference, veryfing need for geocoding

5 GUS-PK

2GUS-DI

1 GUS-AZ

3US Poznań

2 US Olsztyn

1US Wrocław / OBDL J. Góra

Stage II – until 7/10/2017

• plan for transformation of existing data sources intoopen formats

• creation of RDF metadata for data sources

• feasibility analysis for publishing linked open data (building a SPARQL endpoint)

5 GUS-PK

1GUS-AZ

3 US Poznań

2US Olsztyn

1 US Wrocław / OBDL J. Góra

Identification of data sources

• Three major databases:

• Local Data Bank

• biggest set of statistical information availablefor a wide range of years

• updated monthly

• Demography Database

• integrated data source for state and structureof population, vital statistics and migrations

• Development monitoring system STRATEG

• a system for facilitating and monitoring the development policy

• key measures to monitor execution of strategies at local, regional, transregional and EU level.

Identification of data sources

• Other data sources:

• publications

• tables

• communiques

• announcements

• articles

Identification of data sources

• Metadata:

• thematic category,

• format (PDF, DOC, XLS, CSV),

• spatial reference (country, NUTS, LAU, functional areas, urbanareas),

• temporal reference (years)

• presence of identifiers (TERYT, NTS, NUTS)

• update cycle

Preliminary analysis of data sources

• Key aspects:

• openness

• redundance of information

• popularity (based on view and download statistics)

• Inclusion / exclusion of the data source

Statistical units harmonization

• Basis:

• NTS (Nomenclature of Territorial Units for Statistical Purposes)

Name NTS NUTS/LAU Identifier

Region 1 NUTS 1 1.6

Voivodship 2 NUTS 2 2.6.22

Subregion 3 NUTS 3 3.6.22.40

Powiat 4 LAU 1 4.6.22.40.11

Gmina 5 LAU 2 5.6.22.40.11.01.1

Statistical units harmonization

• Input data:

• administrative boundaries since 2002 for LAU 2 (gmina), excluding 2007

• Harmonization process:

• structure standardization

• standardization of identifiers (creating NTS identifiers)

• aggregation to higher level units (LAU 1 -> NUTS 1)

Statistical units harmonization

• Non-standard statistical units:

• functional areas

• urban areas

• Groups of NTS units

• Derive mostly from strategic documents

• Changes of geometries in time to be determined

Statistical units URIs

• NTS as basic classification

Name NTS NUTS/LAU

Identifier URIhttp://nts.stat.gov.pl/...

Region 1 NUTS 1 1.6 …1/6

Voivodship 2 NUTS 2 2.6.22 …2/6/22

Subregion 3 NUTS 3 3.6.22.40 …3/6/22/40

Powiat 4 LAU 1 4.6.22.40.11 …4/6/22/40/11

Gmina 5 LAU 2 5.6.22.40.11.01.1 …5/6/22/40/11/01/1

http://nts.stat.gov.pl/5/6/22/40/11/01/1

Data transformation plan

• From ontology to SPARQL endpoint

• Decide what will be published as Open Data

• three major databases

• other data sources

• Create ontology

• Map to existing databases

• Export to RDF data store

• Publish on linked data server

• Workflow tested on STRATEG database

Ontology - methods and tools

• Ontop - platform to query databases as Virtual RDF Graphs using SPARQL

• SPARQL 1.0 Support

• Supports interface for ontology development

• Intuitive/powerful mapping language

• Support for free and commercial DBMS

• SPARQL end-point

Mapping ontology on database

SPARQL query on mapped data

SPARQL endpoint tools for the web

• Apache Jena Fuseki

• Fuseki is a SPARQL server. It allows REST-style SPARQL Query.

• Ontop generated RDF’s are imported to Apache Jena

• Pubby

• A Linked Data Frontend for SPARQL Endpoints

• Pubby makes it easy to turn a SPARQL endpoint into a Linked Data server. It is implemented as a Java web application.

• Provides data at given linked data uri

Fuseki SPARQL endpoint query

Query result facilitated by Pubby

Further works

• Consultation of the designed workflow during a studyvisit at the Madrid University of Technology

• Setting up an internal test linked data server to implement web tools

• Creating ontologies and workflows for databases and other data sources

Summary – results so far

• Harmonized geometries for statistical units

• Identified data sources with comprehensive metadata

• Preliminary data transformation plan with tools tested

Poland’s data opening strategy

• launched this year

• aimed at opening data resources of governmentinstitutions with respect to the 5-stars of linked open data goals

• the grant results (guidelines) in line with the strategy

• increased probability of acquiring financing for a fullyfledged implementation

INSPIRE Thematic Clusters

https://themes.jrc.ec.europa.eu – collaboration platform

Statistical Cluster:

statistical units

population distribution (demography)

human health and safety

Informal meeting of Cluster members duringthe coffee break (15:30-16:00)

Questions?