Ingestion workflows. Presentation at the Europeana Aggregator Forum 2015

34
Europeana’s and aggregators ingestion workflows Cécile Devarenne Operations Officer Aggregators Forum Workshop Den Haag, 18th May 2015

Transcript of Ingestion workflows. Presentation at the Europeana Aggregator Forum 2015

Europeana’s and aggregators ingestion workflows

Cécile Devarenne Operations Officer

Aggregators Forum Workshop Den Haag, 18th May 2015

Content

• Europeana Aggregation workflow • Europeana Ingestion & data processing workflow

• Overview of tools • Data flows and tasks

• Workshop talk 1: common and specifics • Workshop talk 2: the good, the bad • Workshop talk 3: gathering ideas

Europeana aggregation workflow

Europeana ingestion workflow

Submission of data and publication cycles

Submission of data and publication cycles

• Operations officers work on a monthly cycle • Each month, data needs to be submitted by the 21st to

be included in the coming cycle and published by the 15/20th of the following month

• A dataset takes on average 40 mins to process • Around 200 datasets are processed by the Operations

officers for each cycle of publication • Datasets go through a full flow of operations before they

are production ready

What happens to your data? Europeana ingestion tools and softwares

Europeana ingestion workflow

Unified Ingestion Manager

What happens to your data? Ingestion processes and data flows

Steps to get data ingested

• Data flows: • IMPORT • MAP/EDIT/TRANSFORM • VALIDATE • ENRICH • PUBLISH

IMPORT

• Manage/structure data for imported collections (CRM & UIM):

• Datasets entries created and monitored • Harvest data records grouped in datasets (Repox):

• OAI-PMH and http protocol; xml data • No incremental harvesting • Storage of data into Repox’s database

• Ongoing developments: • New version of Repox, which will be shared by

Europeana and The European Library • Repox’s harvested data stored in the Europeana Cloud

IMPORT (Repox)

MAP/EDIT/TRANSFORM (Mint)

• Map and transform from source (ESE, EDM External) to target (EDM Internal) (Mint):

• User interface, drag and drop functionality • All mappings are stored • The last versions of the incoming and transformed

data are stored in Mint’s database • Clean data (Mint)

• User interface, functions are applied to the data • Statistics and preview functionality help for the quality

checks

MAP/EDIT/TRANSFORM (Mint)

MAP/EDIT/TRANSFORM (UIM)

• Itemize and create/manage Europeana identifiers for permalinks to your records in Europeana (UIM plugin)

• One record per ProvidedCHO • From http://data.theeuropeanlibrary.org/

BibliographicResource/3000118920655 to http://data.europeana.eu/item/9200338/BibliographicResource_3000118920655

MAP/EDIT/TRANSFORM (UIM)

VALIDATE

• Validate data automatically against the EDM Internal schema (Mint and UIM):

• XSD schema and schematron rules (mandatory elements, types of values)

• Only valid records are saved in UIM’s database: invalid records are discarded

• Checks on unique identifiers within a dataset: duplicate records are also discarded

VALIDATE (Mint)

VALIDATE (UIM)

ENRICH (semantic enrichment)

• Data enrichment against external datasets • Dereference (UIM plugin):

• generate additional contextual data in EDM from links to linked data exposed ontologies

• maintain mappings between the vocabularies to be dereferenced and EDM

• Enrich (UIM plugin): • generate additional contextual data (links to

external resources) from analysis of the provided data

• maintain the corpus of resources against which the EDM data is enriched; maintain enrichment rules

ENRICH

• Media links: • Cache thumbnails • Extract technical metadata from links to digital

objects • Extract hierarchies:

• Hierarchical data (several objects within a dataset related to one another to reflect a hierarchy, e.g. a book and its chapters) is processed through our hierarchical objects plugin

ENRICH

PUBLISH

• Deploy content monthly • New and updated data retrievable on europeana.eu

and API

Happy ingestion :-)

Enough about Europeana… how about you?

Workshop talk 1: common and specific

IMPORT TRANSFORM

VALIDATE

ENRICH

PUBLISH

• Do you agree with these broad categories and believe they also apply to your data workflow? Is it the case for all?

• Could you list tasks under each category illustrating your work? Example: creation of persistant identifiers, link checking

• Are these tasks all common to everyone at your table? • What are from your list the specific tasks that you do

not see represented at your table?

• 20 minutes

Workshop talk 1: common and specific

• What are the principal issues you encounter within your workflows?

• Are these issues shared with other partners at your table?

• What is the impact of the connections of workflows - data provider to aggregator to Europeana - on your data processing tasks?

• 20 minutes

Workshop talk 2: the good, the bad

• What kind of concrete changes from Europeana could improve your data processing work?

• What steps in your current workflow could you use help with? (validation, preview, …)

• Are there any tools you use already that you could recommend to everyone?

• All feedback and questions are welcome!

• 20 minutes

Workshop talk 3: gathering ideas

Guidance and help

Guidance and help Europeana Professional:http://pro.europeana.eu/share-your-data/Content inbox – for all ingestion & metadata related matters [email protected]

Thank you!

Cécile Devarenne, Chiara Latronico, Marie-Claire Dangerfield, Pablo Uceda Gomez, Jeroen Cichy

[email protected] or [email protected]