Linked Census Data

DANS is een instituut van KNAW en NWO

Data Archiving and Networked ServicesData Archiving and Networked Services

Linked Census DataSemantics for Knowledge Discovery of the Past

Albert Meroño-Peñuela

01/03/2013

DANS is een instituut van KNAW en NWO

Main goal: cross queries

Main goal: requirements

• Schema flexibility: do not commit to a specific schema

• Linkage– Internally (e.g between tables), to make relations explicit– Externally

• Harmonization datasets (e.g. HISCO, AC)• Enriching datasets (e.g. labour strikes, book publications)

• Inference: of new knowledge (e.g. ink_manufacturer(X) & ink_manufacturer chemical |= chemical(X))

• Publication: as open data for researchers on the Web (through Service Architectures)

Main goal: RDF datamodel

CEDAR development cycle, iteration 1

• Gathering: only one file• Conversion: TabLinker, small table size• Querying: simple, ad-hoc SPARQL + trivial visualization

Iteration 1: conversion

https://github.com/Data2Semantics/TabLinker

• Supervised Excel to RDF conversion• Python feat. xlutils, xlrd, rdflib libs• Intended for complex layouts that cannot be handled with

automatic csv2rdf scripts• Maps workbooks to the RDF Data Cube vocabulary• Layout needs to be manually annotated

Iteration 1: queryingPREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1889_12_H1_marked/Eerste_gedeelte/> PREFIX ns2: <http://www.data2semantics.org/core/Eerste_gedeelte/Kom/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?place ?size WHERE { ?cell d2s:isObservation [ d2s:dimension d2sdata:Totaal;

d2s:dimension d2sdata:M_; ns2:Buiten_de_kom ?place; d2s:populationSize ?size ] .

?place skos:prefLabel "TOT"@nl . } ORDER BY DESC(?size)

Iteration 1: querying

http://cedar-project.nl/visualizing-sparql-query-results-on-the-census/

Iteration 1: outcome

CEDAR development cycle, iteration 2

• Gathering: arbitrary number of files• But, what do we have?

• Conversion: arbitrary table size, annotations• Querying: SPARQL with mappings, top level ontologies

Iteration 2: gathering

Hey, what’s there?

Inventory of the dataset•How many files do we have?•How many tables/sheets?•How many variables?•How many annotations?

TabExtractor (Python feat. xlrd, Levenshtein libs)

https://github.com/CEDAR-project/TabExtractor

https://github.com/CEDAR-project/TabExtractorhttps://www.dropbox.com/s/ah7lgmji2ofat3w/Census%20summary.xls

https://github.com/CEDAR-project/TabExtractorhttps://www.dropbox.com/s/vw1rf4pp8g8sxn3/annotations-dump-translation.csv

Year File Table Row Col Author1899 VT_1899_06_H5.xls Utrecht 155 3 Vreugdenhil1899 VT_1899_06_H5.xls Utrecht 805 3 Vreugdenhil1930 WT_1930_04_A-T2.xls Tabel 2a 0 0 Helpdesk1930 WT_1930_04_A-T2.xls Tabel 2b 0 0 Th. Vreugdenhil1909 VT_1909_01_T.xls Tabel 1 10058 13 DFS 71909 VT_1909_01_T.xls Tabel 1 3321 15 ServiceProfs 0011909 VT_1909_01_T.xls Tabel 1 11909 13 DFS 71909 VT_1909_01_T.xls Tabel 1 12596 11 DFS 8

• 507 Excel files• 2,288 tables• 33,283 annotated cells

– 10.95% numerical corrections– 89.05% textual descriptions / anomalies

But TabExtractor ain’t a sexy thing…• Bring metadata together• Publish on the Web? Archive?

Subset of the dataset•Miniproject 1

– 1889– Occupational census– Province Noord-Brabant– 1 table

•Miniproject 2– 1859, 1869, 1879, 1889– Population census– Province Noord-Brabant– 4 tables

• Iteration 1 converted to RDF only Excel cells• Some cells have annotations attached

– Value corrections: 5 8 – Explanations, descriptions: Number includes 2 people of

unkown age– Inconsistencies: Sum does not add up

• Iteration 2 produces proper named graphs for annotations

Annotations data model

Iteration 2: data quality

• Annotations can improve data quality• Model has to be extended with actions

– If sum doesn’t add up Retrieve numbers from other tables/sources

– Appropriate vocabularies

Iteration 2: data quality• Measure of data quality? Benford’s Law

– Data distributions in censuses meet Benford’s Law– Demo available!

Iteration 2: queryingPREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1889_12_H1_marked/Eerste_gedeelte/> PREFIX ns2: <http://www.data2semantics.org/core/Eerste_gedeelte/Kom/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

Iteration 2: queryingPREFIX d2s: <http://www.data2semantics.org/core/>PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1879_10_H1_marked/NOORD-BRABANT/>PREFIX ns2: <http://www.data2semantics.org/core/Kom-buiten-de-kom/>PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?place ?size WHERE {?cell d2s:isObservation [ d2s:dimension d2sdata:Totaal;

d2s:dimension d2sdata:M;ns2:Kom_Buiten_de_kom ?place;d2s:populationSize ?size ] .?place skos:prefLabel "Totaal in

de gemeente"@nl .}ORDER BY DESC(?size)

PREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1889_12_H1_marked/Eerste_gedeelte/> PREFIX ns2: <http://www.data2semantics.org/core/Eerste_gedeelte/Kom/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

PREFIX d2s: <http://www.data2semantics.org/core/>PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1879_10_H1_marked/NOORD-BRABANT/>PREFIX ns2: <http://www.data2semantics.org/core/Kom-buiten-de-kom/>PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?place ?size WHERE {?cell d2s:isObservation [ d2s:dimension d2sdata:Totaal;

d2s:dimension d2sdata:M;ns2:Kom_Buiten_de_kom ?place;d2s:populationSize ?size ] .?place skos:prefLabel "Totaal in

de gemeente"@nl .}ORDER BY DESC(?size)

PREFIX d2s: <http://www.data2semantics.org/core/> PREFIX d2sdata: <http://www.data2semantics.org/data/VT_1889_12_H1_marked/Eerste_gedeelte/> PREFIX ns2: <http://www.data2semantics.org/core/Eerste_gedeelte/Kom/> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

Iteration 2: querying

• Things to be mapped– Occupations (HISCO)– Municipalities (Amsterdamse Code)– Housing types– Religions– Etc.

• Converted the HISCO and AC mappings to RDF (https://github.com/CEDAR-project/Harmonize)– Linked to the tables RDF

Iteration 2: linking HISCO

Iteration 2: linking AC

Iteration 2: linking

• Issue: HISCO is too generic (top-down approach)– Class 21110 too abstract: General Manager– Visualization of SPARQL HISCO mappings

• Issue: AC works at the municipality level– Other geographical harmonizations?

• Need for year-level ontologies– Classification systems are different

• R script to do bottom-up approach Classification extractor (https://github.com/albertmeronyo/OccupationOntology)

– Automated removal of non-related cols and rows– Introduction of redundancy (‘Id.’ values)– Removal of totals– Work in progress: ontology merging

Upper ontologies (HISCO, AC)

Year-dependent ontologies

Concept drift

• Models drift over time• Classes merge, split, change their properties

(beroepenklassen)• Although, some core meaning remains (shoemakers)• Can we automatically identify and align drifted

models?

? ?t1 t2 tn

Conclusion: milestones

• Complete inventory of the dataset (w/ metadata generation)

• Translation to RDF– Raw data– Annotations– Harmonization/linking

• Successful data quality experiments (Benford’s Law)• Useful software

– TabLinker (Excel/CSV to RDF)– TabExtractor (Excel/CSV metadata collector)– Harmonize (HISCO/AC to Census linker)– OccupationOntology (bottom-up occupation ontology extractor)

Conclusion: future work

• Better software– TabLinker: automate mark-up process– TabExtractor: improve and publish inventory output– Harmonize: improve HISCO/AC datamodels– OccupationOntology: extend to housing types, religions, etc.

• Concept drift literature on drifting models (Kuukkanen 2008, Gonçalves et al. 2009, Shenghui et al. 2010)

• Semantic Web literature on modeling geographical change (Kauppinen 2010)

– Integrate with AC dataset?

• Link meaningful datasets with the census– Labour strikes– Book publications– More?

Thank you

http://www.cedar-project.nl

albert.merono@dans.knaw.nl

Linked Census Data

News & Politics

Transcript of Linked Census Data

Census Data Update

Accessing UK census aggregate data using InFuse · Accessing UK census aggregate data using InFuse Richard Wiseman UK Data Service Census Applications: Using the UK population census

Linked Data at present Using Linked Data

Gulbarga Census Data

Linked data

Open | Linked | Open Linked data

U.S. Census Overview SOC 101. Outline of Presentation Census History Census Questionnaire Census Geography Census Data American FactFinder.

Linked Justifications: Provenance Aware Data Integration on Linked Data

Linked Data and APIs Session 1: Linked Data - ct3.ijs.sict3.ijs.si/wp-content/uploads/2011/11/planetdata-tutorial-linked... · Linked Data and APIs Session 1: Linked Data Günter

Census Data Gathering

Linked(open data)vsopen(linked data) lod2014roma

Mapping census data

Linked Data & DBpedia - fusionfactory.de€¦ · Linked Data & DBpedia ... label "Siemens"@de ; dbo: ... Linked Data & DBpedia Linked Open Data LOD-Cloud 2014 Linked Data - Datasets

Linked Census Data

Replacing immigration questions with linked administrative ...€¦ · Replacing immigration questions with linked administrative data for the Canadian Census of Population Scott

2010 Census Data

Admin data census

Census Data Sources - About Us | Department of …ibis.geog.ubc.ca/courses/geob370/labs/CensusDataandGIS.pdf · Census Data Sources ... Acquiring Census Data and Incorporating Files

News Linked Data Summit - BBC News and Linked Data

Using Linked Data Concepts to Blend and - EFGS · •CensLOD project, ISTAT, Italy –Publishing 2015 census data as linked (open) data –Infrastructure development to publish Linked