Linked Census Data

38
Linked Census Data Rinke Hoekstra CEDAR Kicko, 26 January 2012 donderdag 26 januari 12

description

Talk about the use of Linked Data in historical research on census data. Has some slides about TabLInker as well (http://github.com/Data2Semantics/TabLinker). Part of the data2semantics project (http://data2semantics.org)

Transcript of Linked Census Data

Page 1: Linked Census Data

Linked Census DataRinke Hoekstra

CEDAR Kickoff, 26 January 2012

donderdag 26 januari 12

Page 2: Linked Census Data

Overview

Problem

Procedure (as I understand it)

Step-by-step

Vocabularies, tools

Conclusion

“Can Linked Data make a difference for historical analysis?”

donderdag 26 januari 12

Page 3: Linked Census Data

Problem~519 Excel spreadsheets (more?... I heard 1200)

Want to do analysis over time and space, but...

Structure

Excel sheets cannot be readily imported in a database

Contents

Excel sheets are not normalised (age) nor harmonised (occupations/places)

Excel sheets contain errors (both original and data-entry)

Want to preserve all stages of data cleansing/harmonisation

donderdag 26 januari 12

Page 4: Linked Census Data

Procedure

Archiving

Correcting/Interpreting

Normalising

Harmonising

Visualising

Verbatim import of sheets to database/triple store

Add missing information (headers)Add corrected information (data)

Interpret and correct objective information

Link information across sheetsLink information to other datasets (e.g. locations)

Build (generic) visualisations of results

Docum

enting

donderdag 26 januari 12

Page 5: Linked Census Data

... a bit about Linked Data

“Just another Data Model”RDF ≠ Ontology (OWL)RDF ≠ Taxonomy (RDFS/SKOS)

Globally Unique Identifiers (URI) for all entities

Dereferencable on the Web (URI = URL)

HTTP-accessible databases (triple stores, SPARQL)

Triples all the way <subject,  predicate,  object>

donderdag 26 januari 12

Page 6: Linked Census Data

Spreadsheet ≠ Database

Primary Keys are entities

Column names are attributes

Cell values are attribute values

Secondary keys are relations to other entities

donderdag 26 januari 12

Page 7: Linked Census Data

Spreadsheet ≠ Database

Primary Keys are entities

Column names are attributes

Cell values are attribute values

Secondary keys are relations to other entities

donderdag 26 januari 12

Page 8: Linked Census Data

Spreadsheet ≠ Database

Primary Keys are entities

Column names are attributes

Cell values are attribute values

Secondary keys are relations to other entities

donderdag 26 januari 12

Page 9: Linked Census Data

Spreadsheet ≠ Database

No Primary Keys!

Anything can be an entity

Column headers are “types”

Row headers are “types”

Hierarchies!

Cell values are entity “values”

No relations to other entities

donderdag 26 januari 12

Page 10: Linked Census Data

Anatomy of a Spreadsheet

Workbook

Cell

Sheet

CellCell

CellCellCell

CellCellCell

Cell

Sheet

CellCell

CellCellCell

CellCellCell

donderdag 26 januari 12

Page 11: Linked Census Data

Anatomy of a Spreadsheet

Workbook1.xls

Sheet1:C1

Sheet1

Sheet1:B1Sheet1:A1

Sheet1:C2Sheet1:B2Sheet1:A2

.........

Sheet2

Sheet2:C1Sheet2:B1Sheet2:A1

Sheet2:C2Sheet2:B2Sheet2:A2

.........

donderdag 26 januari 12

Page 12: Linked Census Data

Anatomy of a Spreadsheet

Workbook1.xls

12

Sheet1

agricultureworkers

6industry

......

Sheet2

34Adiamond cutters

67B

.........

donderdag 26 januari 12

Page 13: Linked Census Data

Anatomy of a Spreadsheet

Workbook1.xls

12

Sheet1

agricultureworkers

6industry

......

Sheet2

34Adiamond cutters

67B

.........

NB: all URIs scoped to sheet!

donderdag 26 januari 12

Page 14: Linked Census Data

Data Cube

How to best represent numeric data, in a flexible way?

SDMX (Eurostat, World Bank, CBS, etc.)

Every data item is an observation

Every observation has a value

Every observation has one or more dimensions

donderdag 26 januari 12

Page 15: Linked Census Data

Data Cube

How to best represent numeric data, in a flexible way?

SDMX (Eurostat, World Bank, CBS, etc.)

Every data item is an observation

Every observation has a value

Every observation has one or more dimensions

donderdag 26 januari 12

Page 16: Linked Census Data

Data Cube

How to best represent numeric data, in a flexible way?

SDMX (Eurostat, World Bank, CBS, etc.)

Every data item is an observation

Every observation has a value

Every observation has one or more dimensions1D

pannenbakkersE

I

positie

beroep

letter der beroepsklasse

nummer der beroepsklasse

geslacht

O

huwelijkse staat

M

geboortejaar

12

leeftijd

1878

donderdag 26 januari 12

Page 17: Linked Census Data

Data Cube

How to best represent numeric data, in a flexible way?

SDMX (Eurostat, World Bank, CBS, etc.)

Every data item is an observation

Every observation has a value

Every observation has one or more dimensions1D

pannenbakkersE

I

positie

beroep

letter der beroepsklasse

nummer der beroepsklasse

geslacht

O

huwelijkse staat

M

geboortejaar

12

leeftijd

1878

1D

pannenbakkersE

I

positie

beroep

letter der beroepsklasse

nummer der beroepsklasse

geslacht

O

huwelijkse staat

M

geboortejaar

12

leeftijd

1878

?

?

??

donderdag 26 januari 12

Page 18: Linked Census Data

Anatomy of a Spreadsheet

HeadersProperties

DataRowHeaders

donderdag 26 januari 12

Page 19: Linked Census Data

Anatomy of a Spreadsheet

HeadersProperties

DataRowHeaders

donderdag 26 januari 12

Page 20: Linked Census Data

Anatomy of a Spreadsheet

HeadersProperties

DataRowHeaders

http://github.com/Data2Semantics/TabLinkerdonderdag 26 januari 12

Page 21: Linked Census Data

_:x

Sheet1:I/E/Fabricage_van_dakpannen__pannenbakkers

:I/E

:I

skos:broader

skos:broader

:O

:M

:14--15_1875--1874

:Nummer_der_beroepsklasse

:Letter__Onderdeel_beroepsklasse_

d2s:dimension

d2s:dimension

d2s:dimension

:D

:Positie_in_het_beroep__aangeduid_met_A__B__C_of_D

"1"^^xsd:int

d2s:populationSize

:BENAMING_van_de_onderdeelen_der_onderscheidene_beroepsklassen__met_de_daartoe_behoorende_beroepen

Sheet1:D15

donderdag 26 januari 12

Page 22: Linked Census Data

Sheet1:L15

d2s:DataCell

rdf:type

_:x

d2s:isObservation

Sheet1:I/E/Fabricage_van_dakpannen__pannenbakkers

:I/E

:I

skos:broader

skos:broader

:10

:O

:M

:14--15_1875--1874

:5

:Nummer_der_beroepsklasse

:Letter__Onderdeel_beroepsklasse_

d2s:dimension

:Regelnummerd2s:dimension

d2s:dimension

d2s:dimension

:D

:Positie_in_het_beroep__aangeduid_met_A__B__C_of_D

"1"^^xsd:int

d2s:populationSize

:BENAMING_van_de_onderdeelen_der_onderscheidene_beroepsklassen__met_de_daartoe_behoorende_beroepen

Sheet1:L3

d2s:Header

rdf:type

d2s:isDimension

Sheet1:L4

d2s:isDimension

Sheet1:L5

d2s:isDimension

rdf:type rdf:type

Sheet1:B8

d2s:HierarchicalRowHeader

rdf:type

Sheet1:C14Sheet1:E15

rdf:typerdf:type

d2s:isDimension

d2s:isDimension

d2s:isDimension

Sheet1:F15

d2s:RowHeader

rdf:type

Sheet1:D15

rdf:type

d2s:isDimension d2s:isDimension

d2s:Metadata

Sheet1:L6

d2s:isDimension

rdf:type

donderdag 26 januari 12

Page 23: Linked Census Data

What TabLinker can’t doAnnotations“footnote”-style on separate sheet

Interpret functions e.g. automatic sums

Integrate/harmonise across sheets/files

Additional useful functionality:

“checksum” functionality

Export to database tables

donderdag 26 januari 12

Page 24: Linked Census Data

Normalising & Correcting

_:x

:14--15_1875--1874

d2s:dimension

"1"^^xsd:int

d2s:populationSize

donderdag 26 januari 12

Page 25: Linked Census Data

Normalising & Correcting

_:x

:14--15_1875--1874

d2s:dimension

"1"^^xsd:int

d2s:populationSize

_:x

:14--15_1875--1874

d2s:dimension

"11"^^xsd:int

d2s:populationSize

"1"^^xsd:int

d2s:populationSize

:14-15

d2s:ageGroup

:1875--1874d2s:birthYears

"1889"^^xsd:intd2s:censusYear

:Assendelft

d2s:gemeente

donderdag 26 januari 12

Page 26: Linked Census Data

Documenting

http://www.w3.org/TR/prov-o/

<http://example.com/workbook1/sheet1/corrected><http://example.com/workbook1/sheet1>

:curation20120126

provo:wasGeneratedBy

provo:Activity

:RinkeHoekstra

_:a_:b

rdf:type

provo:hadAgent

provo:endedAtprovo:startedAt

"20120126T09:00:00" "20120126T08:30:00"

time:inXSDDateTime time:inXSDDateTime

_:x

:14--15_1875--1874

d2s:dimension

"11"^^xsd:int

d2s:populationSize

"1"^^xsd:int

d2s:populationSize

:14-15

d2s:ageGroup

:1875--1874d2s:birthYears

"1889"^^xsd:intd2s:censusYear

:Assendelft

d2s:gemeente

donderdag 26 januari 12

Page 27: Linked Census Data

Fabricage van dakpannen (pannenbakkers)

E

I

skos:broader

skos:broader

Fabricage van steen (molensteen, steenbakkers,

tegelbakkers)

D

Fabricage van kalk

skos:broaderskos:broader

A

Fabricage van aardewerk (incl.

porcelein, terracotta, kachelbakkers,

pottenbakkers, enz.)

skos:broader

skos:broader

skos:broader

Harmonising

donderdag 26 januari 12

Page 28: Linked Census Data

Fabricage van dakpannen (pannenbakkers)

E

I

skos:broader

skos:broader

Fabricage van steen (molensteen, steenbakkers,

tegelbakkers)

D

Fabricage van kalk

skos:broaderskos:broader

A

Fabricage van aardewerk (incl.

porcelein, terracotta, kachelbakkers,

pottenbakkers, enz.)

skos:broader

skos:broader

skos:broader

HISCO:23811 HISCO:25281

HISCO:25281

HISCO:25281 HISCO:26345

HISCO:23810 HISCO:26340

skos:exactMatch

skos:exactMatchskos:closeMatch

skos:exactMatchskos:exactMatch

skos:broadMatchskos:broadMatch

Harmonising

donderdag 26 januari 12

Page 29: Linked Census Data

Harmonising

Sheet1:Fabricage van dakpannen

(pannenbakkers)

Sheet1:E

Sheet1:I

skos:broader

skos:broader

Sheet1:Fabricage van steen (molensteen, steenbakkers,

tegelbakkers)

Sheet1:D

Sheet1:Fabricage van kalk

skos:broaderskos:broader

Sheet1:A

Sheet1:Fabricage van aardewerk (incl.

porcelein, terracotta, kachelbakkers,

pottenbakkers, enz.)

skos:broader

skos:broaderskos:broader

Fabricage van dakpannen (pannenbakkers)

E

I

skos:broader

skos:broader

Fabricage van steen (molensteen, steenbakkers,

tegelbakkers)

D

Fabricage van kalk

skos:broaderskos:broader

A

Fabricage van aardewerk (incl.

porcelein, terracotta, kachelbakkers,

pottenbakkers, enz.)

skos:broader

skos:broader

skos:broader

donderdag 26 januari 12

Page 30: Linked Census Data

Fabricage van dakpannen (pannenbakkers)

E

I

skos:broader

skos:broader

Fabricage van steen (molensteen, steenbakkers,

tegelbakkers)

D

Fabricage van kalk

skos:broaderskos:broader

A

Fabricage van aardewerk (incl.

porcelein, terracotta, kachelbakkers,

pottenbakkers, enz.)

skos:broader

skos:broader

skos:broader

Fabricage van dakpannen (pannenbakkers)

E

I

skos:broader

skos:broader

Fabricage van steen (steenbakkers, tegelbakkers)

D

Fabricage van kalk

skos:broaderskos:broader

A

Fabricage van aardewerk (incl.

porcelein, kachelbakkers,

pottenbakkers, enz.)

skos:broader

skos:broader

skos:broader

1889

1899

skos:exactMatch

skos:narrowMatch skos:closeMatch

skos:narrowMatch

donderdag 26 januari 12

Page 31: Linked Census Data

Fabricage van dakpannen (pannenbakkers)

E

I

skos:broader

skos:broader

Fabricage van steen (molensteen, steenbakkers,

tegelbakkers)

D

Fabricage van kalk

skos:broaderskos:broader

A

Fabricage van aardewerk (incl.

porcelein, terracotta, kachelbakkers,

pottenbakkers, enz.)

skos:broader

skos:broader

skos:broader

Fabricage van dakpannen (pannenbakkers)

E

I

skos:broader

skos:broader

Fabricage van steen (steenbakkers, tegelbakkers)

D

Fabricage van kalk

skos:broaderskos:broader

A

Fabricage van aardewerk (incl.

porcelein, kachelbakkers,

pottenbakkers, enz.)

skos:broader

skos:broader

skos:broader

1889

1899

skos:exactMatch

skos:narrowMatch skos:closeMatch

skos:narrowMatch

Is SKOS sufficient?

NB: These are not strings, but globally unique URIs, scoped within their spreadsheet (graph!) of origin.

donderdag 26 januari 12

Page 32: Linked Census Data

Fabricage van dakpannen (pannenbakkers)

E

I

skos:broader

skos:broader

Fabricage van steen (molensteen, steenbakkers,

tegelbakkers)

D

Fabricage van kalk

skos:broaderskos:broader

A

Fabricage van aardewerk (incl.

porcelein, terracotta, kachelbakkers,

pottenbakkers, enz.)

skos:broader

skos:broader

skos:broader

Fabricage van dakpannen (pannenbakkers)

E

I

skos:broader

skos:broader

Fabricage van steen (steenbakkers, tegelbakkers)

D

Fabricage van kalk

skos:broaderskos:broader

A

Fabricage van aardewerk (incl.

porcelein, kachelbakkers,

pottenbakkers, enz.)

skos:broader

skos:broader

skos:broader

1889

1899

skos:exactMatch

skos:narrowMatch skos:closeMatch

skos:narrowMatch

Is SKOS sufficient?

NB: These are not strings, but globally unique URIs, scoped within their spreadsheet (graph!) of origin.

donderdag 26 januari 12

Page 33: Linked Census Data

Fabricage van dakpannen (pannenbakkers)

E

I

skos:broader

skos:broader

Fabricage van steen (molensteen, steenbakkers,

tegelbakkers)

D

Fabricage van kalk

skos:broaderskos:broader

A

Fabricage van aardewerk (incl.

porcelein, terracotta, kachelbakkers,

pottenbakkers, enz.)

skos:broader

skos:broader

skos:broader

Fabricage van dakpannen (pannenbakkers)

E

I

skos:broader

skos:broader

Fabricage van steen (steenbakkers, tegelbakkers)

D

Fabricage van kalk

skos:broaderskos:broader

A

Fabricage van aardewerk (incl.

porcelein, kachelbakkers,

pottenbakkers, enz.)

skos:broader

skos:broader

skos:broader

1889

1899

skos:exactMatch

skos:narrowMatch skos:closeMatch

skos:narrowMatch

Is SKOS sufficient?

NB: These are not strings, but globally unique URIs, scoped within their spreadsheet (graph!) of origin.

donderdag 26 januari 12

Page 34: Linked Census Data

Vocabularies, Tools

VocabulariesData Cube, SKOS, W3C Time, PROV-O

Excel + TabLinkerSemi-automatic conversion of Excel sheets to RDF

ProvTracerCreate PROV-O provenance trail for shell/python scripts

Visualization PrototypeSGVizler (SPARQL + Google Graph API)

donderdag 26 januari 12

Page 35: Linked Census Data

Discussion

Advantages of Linked Data approach

Straightforward transformation from spreadsheets

Seamless integration of original, corrected and harmonised data

Ingestion of external (linked) data

Powerful documentation (provenance)

Everything is transparently query-able (SPARQL)

.... on the Web

donderdag 26 januari 12

Page 36: Linked Census Data

Discussion

Disadvantages of Linked Data approach (subject to research)

Size? (300k * 519 sheets = 156M triples)

Only rudimentary support for arithmetical operations in queries

No dynamic/conditional ‘view’-like graphs

donderdag 26 januari 12

Page 37: Linked Census Data

SPARQL vs. SQL?

Middle ground?

Expose database through D2RQ

donderdag 26 januari 12

Page 38: Linked Census Data

Fin

donderdag 26 januari 12