Long-term preservation aspects in the eSciDoc project

26
Long-term preservation aspects in the eSciDoc project Natasa Bulatovic Max-Planck Digital Library [email protected]

description

Long-term preservation aspects in the eSciDoc project. Natasa Bulatovic Max-Planck Digital Library [email protected]. eSciDoc: Background information. Joint Project between Max Planck Society and FIZ Karlsruhe - PowerPoint PPT Presentation

Transcript of Long-term preservation aspects in the eSciDoc project

Page 1: Long-term preservation aspects in the eSciDoc project

Long-term preservation aspects in the eSciDoc project

Natasa Bulatovic

Max-Planck Digital Library

[email protected]

Page 2: Long-term preservation aspects in the eSciDoc project

eSciDoc: Background information

Joint Project between Max Planck Society and FIZ Karlsruhe

Funded by German Federal Ministry for Education and Research (BMBF) until Mid 2009 Additional substantial own efforts by both partners Both Partners committed themselves to ongoing efforts until 2011

Page 3: Long-term preservation aspects in the eSciDoc project

eSciDoc project landscape

Page 4: Long-term preservation aspects in the eSciDoc project

Diversity of MPG supported by a service infrastructure

21.04.23

● People

● Librarians, researchers, developers, general public

● Data

● Publications, research data, supplementary material etc.

● Processes

● Various institutes apply various workflows supporting their data management

Page 5: Long-term preservation aspects in the eSciDoc project

eSciDoc: what it addresses and provides?

Organizational aspects Business processes: introspection into the “research information and

research data life-cycle” Modelling of the “research workflows” (usage, scenarios, use cases, process

worfklows) “SOA” is not only technical infrastructure - it is driven by organization-wide

processes and requirements

Technical aspects service infrastructure instead of silos applications solutions to visualize, publish, relate and manage data easy composition of services and data mashups and repurposing

Both organizational and technical activities facilitate the growth of the organization and the dissemination, accessibility and easy reuse of the research results across disciplines

Page 6: Long-term preservation aspects in the eSciDoc project

eSciDoc.PubMan: Management of publications

Page 7: Long-term preservation aspects in the eSciDoc project

eSciDoc.Virr: Management of digitized textual resources

Page 8: Long-term preservation aspects in the eSciDoc project

eSciDoc.FACES: Management of image collections

Page 9: Long-term preservation aspects in the eSciDoc project

Challenges

And facts for eSciDoc project related to the digital preservation and long-term archiving

Page 10: Long-term preservation aspects in the eSciDoc project

Which data are managed?

Support for variety of data (publications, old manuscripts, microscopic images, patents, datasets)

Page 11: Long-term preservation aspects in the eSciDoc project

Abstraction: Data structure

Abstraction in data modelling and implemented data services (items, containers).xml

Specialization through content models gives contextual information on what data it is (publication, scanned book page, lexical resource, collection of “digital things”)

http://colab.mpdl.mpg.de/mediawiki/ESciDoc_Logical_Data_Model

Atomistic model

Page 12: Long-term preservation aspects in the eSciDoc project

Atomistic model - items

21.04.23 Seite 12

Page 13: Long-term preservation aspects in the eSciDoc project

Atomistic model - containers

21.04.23 Seite 13

Page 14: Long-term preservation aspects in the eSciDoc project

Communities and workflows

Core workflow

Support for variety of workflows for different user-communities

Community specific workflow

Page 15: Long-term preservation aspects in the eSciDoc project

Persistent identification

Each resource and resource component (metadata, files) must be persistently identified

Users may decide what to identify (resource, resource version)

Users may configure when to assign the persistent identifier on a repository level

Keep all persistent identifiers of the resource (if such existed before the resource has been created in the system)

Different communities decide what to identify and when in a different manner and at a different state of the completeness of their data

Page 16: Long-term preservation aspects in the eSciDoc project

Persistent identification system

But which PID system to choose out of many ? Handle System (CNRI - Corporation for National Research Initiatives) Digital Object Identifiers - DOI (IDF – International DOI Foundation) Archival Resource Key –ARK (California Digital Library) the Uniform Resource Name - URN (IANA) National Bibliography Numbers - NBNs the persistent URL – PURL (OCLC, Online Computer Library Center) the Open URL

Most important is the organizational commitment for PID and PID infrastructure, PIDs are rather organizational then technical effort

EPIC – European Persistent Identifier Consortium (GWDG Germany, SARA Netherlands, CSC Finland, http://www.pidconsortium.eu/ )

Page 17: Long-term preservation aspects in the eSciDoc project

Resource metadata

Standards and Frameworks – have to be used so data are known Dublin Core, MODS, other Where no standards exist, describe the profile in standard format Dublin Core Singapore Framework

Services, metadata quality and metadata extraction Metadata Validation service CoNE – Control of Named Entities service JHove – technical metadata extraction service Supports and important aspect: quality of data, clear identification of resources,

format migration

Interoperability – data redundancy Ensures resources are disseminated via RSS, OAI-PMH

Support for variety of data (publications, old manuscripts, microscopic images, patents, datasets) and their metadata

Page 18: Long-term preservation aspects in the eSciDoc project

Metadata standards and frameworks

“The Singapore Framework for Dublin Core Application Profiles is a framework for designing metadata applications for maximum interoperability and for documenting such applications for maximum reusability”

http://dublincore.org/architecturewiki/SingaporeFramework/http://dublincore.org/documents/2007/06/04/abstract-model/

Page 19: Long-term preservation aspects in the eSciDoc project

Metadata profiles

● DCAP based (Dublin Core Application Profile)

● DC terms (identified URIs)

● eSciDoc solution specific terms (identified by URIs)

● METS/MODS

● Publicly available

● Functional description http://colab.mpdl.mpg.de/mediawiki/ESciDoc_Application_Profiles

● Application profiles schemas (xsd) http://metadata.mpdl.mpg.de/escidoc/metadata/schemas/0.1/

● Vocabulary encoding schemas http://colab.mpdl.mpg.de/mediawiki/ESciDoc_Encoding_Schemes

● Interoperability levels

● Shared term definitions (done)

● Semantic interoperability (done)

● Description set syntactic interoperability (started)

● Description set profile interoperability (started)

Page 20: Long-term preservation aspects in the eSciDoc project

Services

• CoNE – Control of named entities• Journals, Persons, Dewey Decimal Classification (3 public levels), Creative

Commons Licenses (CC licenses), ISO 639-3 Languages, MIME Types, PACS classification, Custom classifications

• “Vocabulary Encoding schemas for resources”

• Validation service• Semantic metadata validation based on Shibboleth

• If the publication is of type “Journal Article” then the “Journal” information has to be present

• Rules are preserved in XML – and are not encoded in the business logic code

• Jhove service• Technical metadata are extracted prior to binary content upload

• Technical metadata are kept together with the binary content

• Example• http://faces.mpdl.mpg.de/details/escidoc:47276

• http://r-coreservice.mpdl.mpg.de/ir/item/escidoc:47276

Page 21: Long-term preservation aspects in the eSciDoc project

Data provenance

Support for event logs and version history (what happened with the content?)

PREMIS v1 created for each resource (events info)

Example: http://pubman.mpdl.mpg.de/pubman/item/escidoc:66603

Upcoming: upgrade to PREMIS v2

Page 22: Long-term preservation aspects in the eSciDoc project

Software, standards, documentation

Software all used software is an open source software eSciDoc software available under CDDL license (open source)

Documentation Accessible (still a lot of work to do) Specifications and some design are available via MPDL Colab e.g.

http://colab.mpdl.mpg.de/mediawiki/PubMan_About

Standards and formats XML, XSD used for interface and service operation messages Use of metadata standards PREMIS Open formats wherever possible e.g. metadata, lexical resources TEI-XML etc. Containment: Fedora XML objects for content resources OAIS reference model – compatible

Bitstream preservation

Page 23: Long-term preservation aspects in the eSciDoc project

OAIS Archive: What we will put for LTA?

•XML•XML/RDF•XSDs•Open source code•Binary content with technical metadata (XML) and calculated MD5 checksums

Page 24: Long-term preservation aspects in the eSciDoc project

Lessons learned

Have we covered all aspects? They must be considered from the very start in infrastructural design and

implementation Things are getting more complicated in dynamic environment (data are changed

often)

Selection process What – all or only selected research data? When – whenever data is created or after the finalization of curration activities? Why - “institutional memory”, future research?

Data redundancy As many copies as possible Share and replicate as much as possible - redundancy does not hurt (it costs )

Organizational commitment Sustainable Long-term Technical Collaboration and cooperation with partners necessary (its not a single unit

effort)

Page 25: Long-term preservation aspects in the eSciDoc project

Thank you!

www.escidoc-project.de http://colab.mpdl.mpg.de

[email protected]

Page 26: Long-term preservation aspects in the eSciDoc project

Resources eSciDoc project web pages, Infrastructure download http://www.escidoc-project.de MPDL collaboratory network http://colab.mpdl.mpg.de MPDL software download

http://escidoc1.escidoc.mpg.de/projects/common_services/

http://escidoc1.escidoc.mpg.de/projects/pubman/ http://escidoc1.escidoc.mpg.de/projects/faces/ http://escidoc1.escidoc.mpg.de/projects/virr/ http://colab.mpdl.mpg.de/mediawiki/ESciDoc_Admin