Ingestion workflows. Presentation at the Europeana Aggregator Forum 2015
Transcript of Ingestion workflows. Presentation at the Europeana Aggregator Forum 2015
Europeana’s and aggregators ingestion workflows
Cécile Devarenne Operations Officer
Aggregators Forum Workshop Den Haag, 18th May 2015
Content
• Europeana Aggregation workflow • Europeana Ingestion & data processing workflow
• Overview of tools • Data flows and tasks
• Workshop talk 1: common and specifics • Workshop talk 2: the good, the bad • Workshop talk 3: gathering ideas
Submission of data and publication cycles
• Operations officers work on a monthly cycle • Each month, data needs to be submitted by the 21st to
be included in the coming cycle and published by the 15/20th of the following month
• A dataset takes on average 40 mins to process • Around 200 datasets are processed by the Operations
officers for each cycle of publication • Datasets go through a full flow of operations before they
are production ready
Steps to get data ingested
• Data flows: • IMPORT • MAP/EDIT/TRANSFORM • VALIDATE • ENRICH • PUBLISH
IMPORT
• Manage/structure data for imported collections (CRM & UIM):
• Datasets entries created and monitored • Harvest data records grouped in datasets (Repox):
• OAI-PMH and http protocol; xml data • No incremental harvesting • Storage of data into Repox’s database
• Ongoing developments: • New version of Repox, which will be shared by
Europeana and The European Library • Repox’s harvested data stored in the Europeana Cloud
MAP/EDIT/TRANSFORM (Mint)
• Map and transform from source (ESE, EDM External) to target (EDM Internal) (Mint):
• User interface, drag and drop functionality • All mappings are stored • The last versions of the incoming and transformed
data are stored in Mint’s database • Clean data (Mint)
• User interface, functions are applied to the data • Statistics and preview functionality help for the quality
checks
MAP/EDIT/TRANSFORM (UIM)
• Itemize and create/manage Europeana identifiers for permalinks to your records in Europeana (UIM plugin)
• One record per ProvidedCHO • From http://data.theeuropeanlibrary.org/
BibliographicResource/3000118920655 to http://data.europeana.eu/item/9200338/BibliographicResource_3000118920655
VALIDATE
• Validate data automatically against the EDM Internal schema (Mint and UIM):
• XSD schema and schematron rules (mandatory elements, types of values)
• Only valid records are saved in UIM’s database: invalid records are discarded
• Checks on unique identifiers within a dataset: duplicate records are also discarded
ENRICH (semantic enrichment)
• Data enrichment against external datasets • Dereference (UIM plugin):
• generate additional contextual data in EDM from links to linked data exposed ontologies
• maintain mappings between the vocabularies to be dereferenced and EDM
• Enrich (UIM plugin): • generate additional contextual data (links to
external resources) from analysis of the provided data
• maintain the corpus of resources against which the EDM data is enriched; maintain enrichment rules
• Media links: • Cache thumbnails • Extract technical metadata from links to digital
objects • Extract hierarchies:
• Hierarchical data (several objects within a dataset related to one another to reflect a hierarchy, e.g. a book and its chapters) is processed through our hierarchical objects plugin
ENRICH
PUBLISH
• Deploy content monthly • New and updated data retrievable on europeana.eu
and API
• Do you agree with these broad categories and believe they also apply to your data workflow? Is it the case for all?
• Could you list tasks under each category illustrating your work? Example: creation of persistant identifiers, link checking
• Are these tasks all common to everyone at your table? • What are from your list the specific tasks that you do
not see represented at your table?
• 20 minutes
Workshop talk 1: common and specific
• What are the principal issues you encounter within your workflows?
• Are these issues shared with other partners at your table?
• What is the impact of the connections of workflows - data provider to aggregator to Europeana - on your data processing tasks?
• 20 minutes
Workshop talk 2: the good, the bad
• What kind of concrete changes from Europeana could improve your data processing work?
• What steps in your current workflow could you use help with? (validation, preview, …)
• Are there any tools you use already that you could recommend to everyone?
• All feedback and questions are welcome!
• 20 minutes
Workshop talk 3: gathering ideas
Guidance and help Europeana Professional:http://pro.europeana.eu/share-your-data/Content inbox – for all ingestion & metadata related matters [email protected]
Thank you!
Cécile Devarenne, Chiara Latronico, Marie-Claire Dangerfield, Pablo Uceda Gomez, Jeroen Cichy