Digital Preservation Workshop - Archivematica · Archivematica technical overview 12:00 Lunch 13:30...
Transcript of Digital Preservation Workshop - Archivematica · Archivematica technical overview 12:00 Lunch 13:30...
Digital Preservation Workshop
5 November 2010University of Calgary
Peter Van GarderenArtefactual Systems
Workshop Agenda
10:00 Introductions
10:15 What is digital preservation?
From strategy to implementation
Archivematica technical overview
12:00 Lunch
13:30 Free and open source software
Archivematica & ICA-AtoM demo/tutorial
Preservation planning (time permitting)
16:00 Wrap-up
NOTE: open discussion / Q&A encouraged throughout
The content in this presentation may be freely re-used under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike 3.0 license.
Attribution:Title: Archivematica: Digital Preservation WorkshopCreator: Peter Van Garderen, Artefactual SystemsDate: November 5, 2010
© Artefactual Systems Inc. 2010
Peter Van GarderenPresident / Systems Archivist
Evelyn McLellanSystems Archivist
Jack BatesSoftware Engineer
David JuhaszSoftware Engineer
Austin TraskSystems Engineer
Jesús García CrespoSoftware Engineer
Joseph PerrySoftware Engineer
open-source sofware for archives and librariesdigital preservation consulting services
http://artefactual.com
Digital Preservation:planning for the long-term accessibility and usability of authentic digital information
a.k.a. digital curation
digital continuity
The Digital Preservation Problem:
Fragility of digital storage media
Lack or loss of adequate metadata
Lack of responsibility and resources
The complexity of digital information
The volume of digital information
Rapid technological change
information
presentation
behaviour
1010 1000 1011 1101
1010 1000 1011 1101
1010 1000 1011
1010 1000 1011
1010 1000 1011 1101
1010 1000 1011 1101
digital
structure
content
context representation file bitstream
object entity
intellectual entity
information
presentation
behaviour
1010 1000 1011 1101
1010 1000 1011 1101
1010 1000 1011
1010 1000 1011
1010 1000 1011 1101
1010 1000 1011 1101
digital
structure
content
context representation file bitstream
object entity
intellectual entity
Information ↔ Record ↔ Archival Material
now future
bitstream
header information
storage media
package
storage device
storage driver
file system
error correction operating system
application software
user interface
input / output devices
metadata
find
relate / bind
authenticate
contextualize
stored
conserved
protected
Digital Preservation is Risk Management
Risk
inability to provide services, manage programs and operate business functions efficiently because of digital information that is not
accessible or usable.
Risk
Poor quality decision-making because digital information that would have been otherwise
available has disappeared or can’t be
trusted to be authentic.
Risk
Exposure to legal liability because the digital information that serves as evidence of the
organization’s compliance or accountability in its contractual, governance and administrative obligations has been lost or can’t be trusted to
be authentic.
Risk
Heightened risk of non-compliance with laws, regulations and policies due to inaccessible
electronic records.
Risk
The organization acquires a reputation as an irresponsible, incompetent or untrustworthy
institution.
Risk
Lost opportunities to re-use and exploit information in digital form.
Risk
Unforeseen cost-creep because the ongoing preservation of digital information is overlooked
during the calculation of costs for new or modified systems.
Risk
Corporate memory loss and ‘cultural amnesia’ as the digital information that documents the
governance, administration and culture of an organization or society disappears from
servers and systems before steps have been taken to preserve it.
The Business Case for Digital Preservation
• Manage risks• Manage information• Manage storage
The anti-Business Case for Digital Preservation
• “don’t we already have backup and a business continuity plan?”
• “don’t we just upgrade the software?”• “storage is cheap”• “we’ll just index everything”• “why can’t we use the ERDMS/ECM system for
this?”
ERDMS / ECM
Digital Archives
Staff Desktops(email, docs, files)
Business Systems(structured data)
Staff External Researchers
active documents
inactive documents
Legacy Systems & Data
Website(s)
Scanning / Imaging
Individual/Ad Hoc Accessions
capture
capture
transfer
destroy
access access access
store
store
organize preserveorganize
archival material
E-Record Creating Environment
Category 1: preserve in source system
Category 2: transfer to digital archives
ERP DAMData Warehouse
??
records schedule
ERDMSSharedDrivesEmail ??
What are the requirements for a digital preservation system?
...that depends, what do kind of future were you expecting?
now future?
Digital Preservation Strategies
technology preservationemulationmigration
normalization
Digital Preservation System Core Requirements
OCLC/NARA Trustworthy Repositories: Audit & Certification (TRAC) www.crl.edu/sites/default/files/attachments/pages/trac_0.p
df ISO 14721 – Open Archival information Systems
(OAIS) http://public.ccsds.org/publications/archive/650x0b1.pdf
PREMIS Data Dictionary for Preservation Metadata www.loc.gov/standards/premis/v2/premis-2-0.pdf
Data Management
Preservation Planning
Archival Storage
Ingest
Administration
SIP
MANAGEMENT
AIP Access DIP
PRODUCER
CONSUMER
Open Archival Information System
Open Archival Information System
● ISO 14721● High level reference model● Default language of the digital preservation world● Key concepts:
● Mandatory Responsibilities● Functional Entities● Information Packages● Actors
OAIS Solutions
Proprietary Safety Deposit Box
– www.tessella.com Rosetta
– www.exlibrisgroup.com
Open Source RODA
– http://roda.di.uminho.pt/ Archivematica
– http://archivematica.org
OAIS Solutions vs. Digital Preservation Tools
Plato Planets Testbed DROID PRONOM JHOVE Fedora Xena Dioscuri
OAIS Solutions vs Digital Preservation Initiatives
Planets CASPAR InterPARES NDIIPP
Business Case Opportunities
ERDMS, ECM implementation Enterprise search implementation Business process/records scheduling analysis Archiving and storage pressure Audits FOI and disclosure/transparency inititiaves Response to digital preservation strategy Inter-institutional partnerships
Project Milestones
Strategy Business Case Technical Analysis Proof-of-concept Pilot(s) Production
System maintenance System scope
Identifying Pilot Projects
Scoring Factors Technical difficulty Intrinsic value Obsolesence risk Transparency risk Breadth of collection Descriptive metadata
Pilot Candidates Institutional
Repository Email ERDMS Shared directories External media
Project Costs Software licenses (proprietary)
Software installation Application server Security Storage integration
Software customization Legacy/source system migrations System-specific ingest/transfer templates Access system integration
Annual maintenance Technical support Release upgrades
Staff
Hardware
Storage
What is Archivematica?
Archivematica is a comprehensive digital preservation system.
Archivematica uses a micro-services design pattern to provide an integrated suite of free and open-source tools that allows users to process digital objects from ingest to access in compliance with the ISO-OAIS functional model.
Archivematica uses METS, PREMIS, Dublin Core and other best practice metadata standards.
Archivematica implements media type preservation plans based on an analysis of the significant characteristics of file formats.
Where did Archivematica come from?
● Artefactual Systems● City of Vancouver Archives● UNESCO Memory of the World● International Monetary Fund Archives● Rockefeller Archives Center● University of British Columbia Library● ?
Data Management
Preservation Planning
Archival Storage
Ingest
Administration
SIP
MANAGEMENT
AIP Access DIP
PRODUCER
CONSUMER
Open Archival Information System
ISO-OAIS
OAIS Use Cases
UMLActivity
Diagrams
Digital Archivessoftware system
requirements
ISO-OAIS
OAIS Use Cases
UMLActivity
Diagrams
SystemWorkflow
Instructions
http://archivematica.org/docs
requirements
documentation
Producer places SIP in shared folder on host machine
[host]/sendSIP/
The producer places a folder of objects in a designated folder on his or her computer. This designated folder has been set up so that it automatically sends its contents to a shared folder in Archivematica.
SIP appears in shared folderIn Archivematica
/1-receiveSIP/The shared folder in Archivematica is 1-receiveSIP. When you are processing the SIP, leave the original copy in this folder as a backup in case you need to go back and start again.
Archivist copies SIP toSIP review folder
/2-reviewSIP/
Archivist reviews SIP /2-reviewSIP/
Check the SIP to make sure it conforms to Submission Agreement.. If MD5checksum.txt file is included, right-click and select Verify MD5 Checksum. Otherwise Archivematica will add checksums to the SIP logs directory and verify checksums at various time throughout the ingest process.
Archivist adds descriptive metadata
/2-reviewSIP/
Open the SIP. Right-click and select Add Dublin Core XML from the drop-down menu. Right-click the dublincore.xml file to open it with Mousepad. Add descriptive information to the appropriate Dublin Core elements and save the file.
Archivist moves SIP toquarantine
/3-quarantineSIP/
= manual step = automated step
- 46 -
= file directory
Agile development method
● Time-based system releases● Feb 2009: Release 0.1-alpha ● May 2010: Release 0.6-alpha● November 2010: Release 0.7-alpha
● Each iteration leads to updated and improved:● Requirements● Software● Documentation● Development resources
http://archivematica.org/software
micro-servicewatched input directory
success output directory
error output directory
Archivematica is:
A classic Unix pipeline of OAIS micro-services provided by a series of open-source tools and integration code written in Python and Bash.
Packaged as a virtual appliance that bundles the Xubuntu operating system and can be run within virtual machines, as a bootable USB or Live DVD, or as a bare metal install on dedicated machines.
Free Beer!
“They’ll never take our freedom”
© 1995 Paramount Pictures & 20th Century FoxSee fair use rationale: http://en.wikipedia.org/wiki/File:Brave_mel.jpg
Free Puppy!
Foundation orSteering Committee
Governance
Coordination
Funding
Promotion
Users
Lead institutions Funding DevelopmentAll users Bug reports Enhancement requests Code patches Documentation Promotion
Open Source Software
Code
Knowledge
Community
Service Providers
Development
Technical Support
Hosting
Training
Promotion
CodeTime
MoneyKnowledge
CodeTimeMoneyKnowledge
TimeMoney
Knowledge
The open-source eco-system
Preservation planning:Normalizing file formats
Defining normalization
What is it? Normalization means converting ingested objects
into a small number of pre-selected formats
Why do it? Some formats are easier to preserve than others A smaller number of formats means fewer
preservation actions required
Normalization vs. migration
Migration is similar to normalization in that it involves converting ingested objects into preservation-friendly formats
Unlike normalization, migration is typically done only when the format is at risk of obsolescence
Migration as a strategy means adopting a wait and see approach
Disadvantages of normalization
It requires more planning up front to implement Re-normalization may be required as better
target formats or conversion tools become available
Advantages of normalization
Taking preservation action on ingest helps define and manage risk Adopting a wait and see approach means putting
off an undefined amount of work for an indefinite period of time at an unknown cost
Normalization does not preclude the future use of migration or other strategies such as emulation
Criteria for choosing formats
1. The format must be non-proprietary− There must be no associated licenses or patents or the
possibility of there being such licenses or patents in the future
Criteria for choosing formats
2. There must be freely available specifications− A specification is a document that explains exactly how
the format is structured and rendered
− This specification must be freely available to all and not subject to copyright or other restrictions
Criteria for choosing formats
3. The format should be widely endorsed and/or adopted
− Other established repositories should be using or have endorsed the format
− Formats that have been approved as international standards are particularly desirable
Criteria for choosing formats
4. For images and audio files there should be no compression
5. For video files any compression should be lossless
Criteria for choosing formats
6. There should be writing and rendering tools available for the format
− Idealized standards must be matched by practical tools
− The tools must reliably meet the requirements of the format specifications and must produce normalized objects that are faithful representations of the original objects
http://archivematica.org