Digital Preservation
description
Transcript of Digital Preservation
![Page 1: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/1.jpg)
Digital Preservation
Dale FleckerStephen Abrams
February 15, 2007
HUL University Library Council
![Page 2: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/2.jpg)
Agenda
I The problem
II What has Harvard been doing?
III What more do we need to do?
![Page 3: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/3.jpg)
I The problem …
![Page 4: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/4.jpg)
… is twofold
• Keeping the bits
• Keeping the bits useful
![Page 5: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/5.jpg)
Keeping the bits
• Digital things are amazingly easy to destroy!– Bad guys want to do damage– Hardware/software fails– People make mistakes
• The slip of a finger, or an unnoticed consequence of change, happen easily - and are potentially catastrophic
![Page 6: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/6.jpg)
Destruction is not always apparent
Data not used regularly is alwaysat risk of unintended and
unnoticed damage.(Note that archival copies can
be pretty invisible…)
![Page 7: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/7.jpg)
Keeping bits useful
Digital materials are fragile!!!
They depend on technologies for their vitality… and those technologies
age and disappear rapidly.
![Page 8: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/8.jpg)
Fragility
• Using digital content requires mediation by hardware and software
• Hardware and software must understand the format of the content
• Hardware and software technology change continually
…
![Page 9: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/9.jpg)
Fragility
• Old technology will break
• New technology frequently does not understand old formats
![Page 10: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/10.jpg)
II What has Harvard been doing?
Internally …
![Page 11: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/11.jpg)
Digital Repository Service (DRS)
• Secure, professionally managed environment
– Manage data rigorously, with discipline, and in accordance to community best practices
• Redundant, heterogeneous, distributed storage with periodic media migration…
![Page 12: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/12.jpg)
Digital Repository Service (DRS)
• Know what data you have
– What are the logical objects (“works”, not files)?
– What are the technical characteristics of those objects?
• Check the data continuously
• Manage access to stored objects
![Page 13: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/13.jpg)
Format
• Understanding formats is fundamental to preservation
ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...
![Page 14: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/14.jpg)
Format
• Understanding formats is fundamental to preservation
ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...
SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...
![Page 15: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/15.jpg)
Format
• Understanding formats is fundamental to preservation
ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...
SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...
![Page 16: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/16.jpg)
Format
• Formats vary significantly in their “preservability”
• Keeping multiple versions of a given piece of content for different purposes is frequently wise
– E.g. archival master, production master, use copy
![Page 17: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/17.jpg)
Format
• Some criteria for “preservability” (from LC)– Disclosure (how well documented?)– Adoption (how widely used?)– Transparency (is compression used?)– Self documenting (good!)– External dependencies (self sufficiency is good)– Patents (could limit preservation actions)– DRM/encryption (what if decryption key is not
available?)
![Page 18: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/18.jpg)
Metadata
• The basis of decision-making for preservation
– Technical metadata• What format is this in?• What format options are used?
– Structural metadata• If I change this, what else is affected?
…
![Page 19: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/19.jpg)
Metadata
– Administrative metadata
• Who has the right to make decisions about this?
– Relationship metadata
• Are there other versions of this object?
– How do these affect my preservation strategy?
– Provenance metadata
• Where did this come from?• What changes has it already undergone?
![Page 20: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/20.jpg)
Guidelines for “preservable” objects
The least expensive, and mosteffective preservation measure
is to think about the future whenan object is created!
(Guidelines on format, metadata,archival masters, etc.)
![Page 21: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/21.jpg)
JHOVE (JSTOR/Harvard Object Validation
Environment)
A widely used tool for format identification, validation, and
characterization.
![Page 22: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/22.jpg)
JHOVE (JSTOR/Harvard Object Validation
Environment)
When an object is ingested:• Determine its format
(“identify”)
• Insure that it is properly formed(“validate”)
• Extract meaningful technical metadata(“characterize”)
![Page 23: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/23.jpg)
DRS: what’s managed today
As of January 2007, 5.6M files and 22 TB, excluding Google and web archiving
![Page 24: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/24.jpg)
II What has Harvard been doing?
Externally…
![Page 25: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/25.jpg)
E-journal archiving
• “How can we ensure that licensed e-journal content will remain usable over time?”
• Mellon-funded study
• Explored technical formats, content types, transactions and dataflows, validation, systems requirements, contractual requirements, business models
• Harvard’s proposed model largely implemented by Portico
![Page 26: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/26.jpg)
Technical Metadata for Digital Still Images
• “What are the appropriate technical metadata necessary for the preservation of images?”
• Standardized as NISO Z39.87
• Expressed in the MIX schema
– Maintained by LC
• The basis for DRS image technical metadata
![Page 27: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/27.jpg)
METS (Metadata Encoding and Transmission
Standard)
• “Is there a generic packaging form for digital content?”
• For example,
– Digital books
– Audio works
– Images (archival master, production master, deliverables)
• Useful for exchange of objects between repositories
• Maintained by LC
![Page 28: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/28.jpg)
Core audio metadata
• “What are the appropriate technical metadata necessary for the preservation of audio?”
• Standardized as AES X-098
• Used as the basis for DRS audio technical metadata
![Page 29: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/29.jpg)
PDF/A
• “PDF defines too many options; is there a ‘flavor’ that will be more ‘preservable’ over time?”
• Requires, recommends, and restricts PDF functionality to enhance preservability
• Standardized as ISO 19005
![Page 30: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/30.jpg)
PREMIS PREservation Metadata: Implementation Strategies
• “What are the general metadata elements necessary to preserve digital content over time?”
• OCLC/RLG-sponsored work group
• Recommendations and best practices for preservation metadata– Core elements, data dictionary, implementation
strategies, cooperative projects
…
![Page 31: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/31.jpg)
PREMIS PREservation Metadata: Implementation Strategies
• Report on current practices and recommended metadata elements available
• Maintained by LC
![Page 32: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/32.jpg)
AIHT (Archive Ingest and Handling Test)
• “What difficulties can we expect to arise during the exchange of content between heterogeneous repositories?”
• LC-funded project to investigate exchange of complex data between preservation repositories
• Harvard, Stanford, Johns Hopkins, Old Dominion ingest and exchange web archive data
![Page 33: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/33.jpg)
GDFR (Global Digital Format Registry)
• “What will need to know in the future about formats in use today, and how will we know it?”
• Shared registry of preservation-related information about technical format
• Reduce work for repositories to create and maintain information about objects they ingest…
![Page 34: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/34.jpg)
GDFR (Global Digital Format Registry)
• Enables sharing of format expertise
• Directed by Harvard, implemented by OCLC
• Funded by Mellon Foundation
![Page 35: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/35.jpg)
Registry of Digital Masters
• “How can I found out who has accepted archival responsibility for a given piece of content?”
• Initially reformatted materials; intention to expand to born-digital
• DLF project
• Implemented by and housed at OCLC
![Page 36: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/36.jpg)
Repository certification
• “Why should a collection manager trust a digital repository?”
• RLG/OCLC report on Trusted Repository Attributes
• RLG/NARA Digital Repository Certification Task Force…
![Page 37: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/37.jpg)
Repository certification
• Recommend structure and metrics of an international process for certifying preservation repositories– Organizational role and structure, staff size and
skill, formal operations and documentation, appropriate technical infrastructure and facilities, on-going funding, and “hand-off” plan, etc.
• CRL Auditing and Certification project
![Page 38: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/38.jpg)
Key activities elsewhere
• ISO 14721 OAIS (Open Archival Information System)
• LC NDIIPP (National Digital Information Infrastructure Preservation Program)
• Web archiving (IA, IIPC)
• NARA ERA (Electronic Records Archiving)
• Digital Curation Centre
• PLANETS
![Page 39: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/39.jpg)
III What more do we need to do?
![Page 40: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/40.jpg)
Evolution: from projects to program
• Digital preservation requires continual pro-active program– You can’t just stop and start– Time frames are MUCH shorter than for preservation
of physical collections
• Need to define scope and role of our preservation efforts
• Investment required in both technology and staffing
![Page 41: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/41.jpg)
Preservation lifecycle
• Creation– Format and technical specification choices– Accompanying metadata– Packaging for ingest
• Ingest– Validation– Normalization
…
![Page 42: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/42.jpg)
Preservation lifecycle
• Assumption of preservation responsibility
• Monitoring– When is intervention necessary?
• Changes to the technical environment
• Changes to user expectations
• Planning– Significant properties
• All preservation decisions involve choice; how to choose what to preserve?
…
![Page 43: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/43.jpg)
Preservation lifecycle
• Intervention (preserving usability)– Re-acquisition– Re-generation from an archival master– Migration before necessary (“just in case”)– Migration at point of request (“just in time”)– Emulation of obsolete technology in contemporary
environment– Universal Virtual Computer (UVC)
• Rewrite necessary software to run on technology-agnostic “virtual” computer
…
![Page 44: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/44.jpg)
Preservation lifecycle
• Intervention (continued)– Save for digital archeologists
• After intervention– Post-intervention quality assurance– Documenting the process of change
• Succession planning– What do we do when we want to get out of the
repository business?
![Page 45: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/45.jpg)
Staffing and responsibilities
• Technical– Infrastructure maintenance– Monitor technological change– Integration into larger preservation environment– Preservation planning
• Curatorial – Preservation intervention will involve trade-offs
• What attributes need to be preserved?• Cost/benefit analysis
![Page 46: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/46.jpg)
Immediate challenges
• Google– Substantial increase in scale (both number and size)– “Dark” content; no expectation of current access
• Web archiving– Explosion of data types– No forethought on format selection and technical
specifications– No metadata– Some failure may be inevitable
![Page 47: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/47.jpg)
Coming soon?
• Institutional repository (IR) to enhance scholarly communication and preserve scholarly creations– Similar to web archiving: objects not typically
created with preservation in mind, nor accompanied by metadata
• “Just in case” local copies of licensed content– May necessitate increased sophistication of IPR
management
![Page 48: Digital Preservation](https://reader035.fdocuments.us/reader035/viewer/2022062222/56815839550346895dc59686/html5/thumbnails/48.jpg)
Longer term issues
• Economics – What can we afford to preserve?• Scale – How much can we preserve?• Selection – What do we leave for others?• Federation – Can we share responsibilities for
preservation?– Copies in independent environments are safest
• Certification – Do we need formal certification?– Note Section 108 revision
• Education – who at Harvard needs to understand?