EUDOR: Digital Archive at the Publications Office of the ...

13
EUDOR: Digital Archive at the Publications Office of the European Union PREMIS ongoing implementation Lina Bountouri ([email protected])

Transcript of EUDOR: Digital Archive at the Publications Office of the ...

EUDOR: Digital Archive at the Publications Office of the European Union PREMIS ongoing implementation

Lina Bountouri ([email protected])

2/x

Collections and formats

• Legislative collections (EUR-

Lex)

• 1M works, +120M files, 15

TB

• Content formats

• PDF/A + signed (OJ) XML

• XML (Formex), XHTML, JPEG

• Other XML

• General publications (EU-Bookshop)

• 700K files, 9 TB

• Content formats

• PDF/X, PDF/A

• JPEG for thumbnails

• Other: ePub, XML…

• Representation information

• Ontologies & other KOS, XML

schemas, etc.

Automated Production Workflow of the EU Official Publications

IMMC XML +

Digital Objects

RDF/OWL+

Digital Objects

in METS

METS SIPs

4/x

Architecture: production system and archive

• Production system: Cellar

• Public

• Ingestion server not exposed

• Replicated to N read/only servers

exposed to internet

• Two parts

• Repository (Fedora)

• Triplestore (Virtuoso and Oracle)

• Ontology based on FRBR

• Normalized KOS

• Archive: Eudor

• Not public

• Roda/OAIS

• SIPs, DIPs: METS

• AIPs: e-Ark

• Fed mainly by Cellar

Representation information: descriptive metadata

• Available at <http://publications.europa.eu/mdr/>

• CDM ontology based on FRBR: RDF/OWL

• Work / Expression (language) / Manifestation (file format) / Item (file/s)

• Complexity: compound works, +24 languages (sometimes multiple), volume (massive

updates), dependencies (works linked +40K)…

• KOS: SKOS/XML

• Over 70 tables, incl. Eurovoc, updated often, backwards compatibility, other clients…

• Access

• RESTful interface, dereferencing with same URI

• SPARQL endpoint: <http://publications.europa.eu/webapi/rdf/sparql/>

5/x

Representation information: technical metadata

• Ontology and files not public (yet)

• Cannot make statistical SPARQL queries on it

• Contains

• File format

• Fixity: checksum and algorithm

• Size

• …

6/x

Provenance/contextual metadata in the EUPO

• Who drove us to this direction?

• Many events are taking place related to each resource

• Internal decision of our management

• New software for our digital archival repository (RODA/KEEP Solutions)

• OAIS/ISO 16363 Consultants for Auditing

• “Are there any metadata for the custody/context/provenance of your

metadata/digital objects?”

Why we have chosen PREMIS?

• PROV-O

• PREMIS

• It was already implemented in the new version of Eudor (v3) - RODA

• It suited our provenance/contextual documentation needs

• It is widely implemented by libraries and archives

• It has a strong community of users

• It is based on RDF

Which provenance/contextual events will we encode?

• We had to limit the number of events to be encoded

• Decision was influenced by the number of triples

• Based partially on the LOC events list, need for new types of events

• For all our WEMI objects

• Modelling almost completed/implementation in our systems

9/x

Which provenance/contextual events will we encode?

• Events in newCeres

• Transmission (metadata by the Data Providers)

• Reception (metadata by newCeres)

• Events in Cellar

• Validation (against METS, KOS, CDM Ontology)

• Operations in Cellar, such as creation, deletion, update, embargo/disembargo

• METS-export (could be useful for traceability and avoid replication of provenance

data in different environments)

• Provenance/contextual metadata from newCeres and Cellar will be stored as

RDF triples in Cellar.

Which provenance/contextual events will we encode?

• Events in Eudor v3

1. Ingest start: The ingest process has started.

2. Unpacking: Extracted objects from package in file/folder format.

3. Wellformedness check: Checked that the received SIP is well formed, complete and that no unexpected files were included.

4. Virus check: Scanned package for malicious programs using ClamAV.

5. Wellformedness check: Checked whether the descriptive metadata is included in the SIP and if this metadata is valid according to the established policy.

6. Message digest calculation: Created base PREMIS objects with file original name and file fixity information (SHA-256).

7. Format identification: Identified the object's file formats and versions using Siegfried.

8. Authorization check: Producer permissions have been checked to insure that he has sufficient authorization to store the AIP under the desired node of the classification scheme.

9. Accession: Added package to the inventory. After this point, the responsibility for the digital content’s preservation is passed on to the repository.

10. Ingest end: The ingest process has ended.

Encoding of preservation actions

11/x

Open issues/Key points

• We have not yet implemented provenance/contextual metadata in Cellar

• We are currently writing the specs

• The size of the RDF triples should not get to high: limitation for the events

we will encode in our production workflow

• The current list of events in the LOC does not cover most of our cases

12/x

any questions?