H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams...
-
Upload
jesse-miller -
Category
Documents
-
view
217 -
download
0
Transcript of H ARVARD U NIVERSITY L IBRARY The Global Digital Format Registry (GDFR) Project Stephen Abrams...
HARVARD UNIVERSITY LIBRARY
The Global Digital Format Registry(GDFR) Project
Stephen AbramsHarvard University
Andreas StanescuOCLC
CNI Fall Task Force MeetingWashington, DC, December 10-11, 2007
HARVARD UNIVERSITY LIBRARY
Digital preservation and format
• Preservation is concerned with ensuring access to managed digital assets over time
• Thus, preservation activities are focused on
– Viability– Fixity– Authenticity– Interpretability– Renderability
• The last two are primarily a function of format
HARVARD UNIVERSITY LIBRARY
Without format typing, all content is opaque
ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...
HARVARD UNIVERSITY LIBRARY
Without format typing, all content is opaque
ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...
SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...
HARVARD UNIVERSITY LIBRARY
Without format typing, all content is opaque
ffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d80228000100000064000000010003030300000001270f0001000100000000000000000000000060080019019000000000000000000000000000000000000000000000000000000000000000003842494d03ed0a5265736f6c7574696f6e0000000010008313a3000200 ...
SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...
HARVARD UNIVERSITY LIBRARY
Global Digital Format Registry
“The Global Digital Format Registry (GDFR) will provide sustainable services to collect, review, store, discover, and deliver significant representation information about digital formats.”
– Centrally-organized collection and review
– Distributed storage, discovery, and delivery on a network of independent, but cooperating registries
HARVARD UNIVERSITY LIBRARY
What is a format?
• “A serialized encoding of an abstract information model”
• Encompasses the nominal sense of “file format” as well as a range of conceptual entities from the micro to the macro level
– IEEE 754 floating point number– File system– In both case, there are well-defined syntactic and
semantic rules for mapping from information to bits, and back again
HARVARD UNIVERSITY LIBRARY
What’s wrong with MIME types?
HARVARD UNIVERSITY LIBRARY
What’s wrong with MIME types?
• Non-standardized documentation
• Intended for human, not machine consumption
• Coarse granularity– image/tiff vs. TIFF 4.0 – 6.0
Baseline Class B, G, P, RExtension Class YTIFF/EPTIFF/IT with file types CT, LW, HC, MP, BP, BP, BL,
FPExif 2.0 – 2.2GeoTIFFTIFF/FXDNG
HARVARD UNIVERSITY LIBRARY
GDFR project
• Two DLF-sponsored invitational workshops
– University of Pennsylvania, January 2003– Washington, March 2003
• Two independent demonstration projects
– FRED [John Ockerbloom, University of Pennsylvania]tom.library.upenn.edu/fred/
– FOCUS [Joseph JaJa, University of Maryland]www.umiacs.umd.edu/~joseph/focus-archiving06.pdf
HARVARD UNIVERSITY LIBRARY
GDFR project
• Harvard University Library (HUL) funded for 2 years by the Andrew W. Mellon Foundation
• Staffing and technical work subcontracted by HUL to OCLC (July 2006)
HARVARD UNIVERSITY LIBRARY
GDFR project oversight
• Technical Working Group (TWG)– Bibliothèque nationale de France– British Library– California Digital Library– Digital Curation Centre, UK– Library of Congress– National Archives, UK– National Archives and Records Administration– National Library of Australia– National Library of New Zealand– Stanford University– University of Pennsylvania
HARVARD UNIVERSITY LIBRARY
General development goals
• A generalized registry framework, specialized for the distributed GDFR application
• Based on well-known products and protocols
• Human and machine interfaces
• Full information content expressible in XML form, and can be re-instantiated from that expression
• Platform independence
• Globally fault tolerant
• Open source
HARVARD UNIVERSITY LIBRARY
GDFR data model
• Consistent with PRONOM registry
Media
Agent
Software
Hardware
Document
dependencies
author, owner, maintainer
documentation
Format
Identifier Name Version Classification Description ReleaseDate WithdrawalDate Rights Signature Byte order Grammar Assessment
Relationship
HARVARD UNIVERSITY LIBRARY
Identifiers
• Canonical, GDFR-assigned identifier
– “info” URI info:rfa/gdfr1/Formats/1
• Other well-known identifiers
– Common name “TIFF”, “Tagged Image File Format”
– MIME type image/tiff
– PRONOM identifier info:pronom/fmt/7
– Library of Congress Format Description Document (FDD) identifier fdd000022
HARVARD UNIVERSITY LIBRARY
Classification scheme
• Eight facets
– Genre (required) text, still-image, sound, aggregate, …
– Role (required) family, file-format, encoding, serialization
– Composition unitary, container-bundle, container-wrapper
– Form binary, text
– Constraint structured, unstructured
– Basis sampled, symbolic
– Domain astronomy, cad-cam, gis, web-archive, …
– Transform compression, encryption, message-digest, …
HARVARD UNIVERSITY LIBRARY
Classification scheme
• Examples
– TIFF (Tagged Image File Format) genre:still-imagerole:familycomposition:container-
wrapperform:binarybasis:sampled
– LZW (Liv-Zempel-Welch) genre:still-imagerole:encodingtransform:compression
– SVG (Scalable Vector Graphics) genre:still-imagerole:file-formatform:textbasis:symbolic
HARVARD UNIVERSITY LIBRARY
Signatures
• External signatures
– File extension– Mac OS type– Mac OS X Uniform Type Identifiers (UTI)
• Internal signatures
– “Magic numbers”– Required vs. optional– Fixed vs. restricted vs. unrestricted
HARVARD UNIVERSITY LIBRARY
Grammar
• Formal description of the syntactic grammar underlying a format, expressed in some formal typed notation
– BNF Backus-Naur Form
– BSDL MPEG-21 Bitstream Syntax Description Language
– DFDL Data Format Description Language
– EAST CCSDS 644.0-B-2
– XCEL Extensible Characterisation Extraction Language
HARVARD UNIVERSITY LIBRARY
Assessment
• Assessment of a format, expressed in some formal typed notation
– Cornell Virtual Remote Control (VRC)
– DTSC PANIC
– Library of Congress Sustainability, Quality, Function (SQF)
– National Library of Australia AONS
– OCLC INFORM
HARVARD UNIVERSITY LIBRARY
Documentation
• Specification documents (and software files) can be managed and distributed in the network
– Applicable only in cases of public domain resources or if explicit permission is granted by rights holders
– Other documents (and software) will be referenced by full citation, including actionable links where possible
– Mechanism for individuals or institutions to register locally-held copies, with terms of use
HARVARD UNIVERSITY LIBRARY
Software
• Format role Input, output
• Process type Characterize, create, edit, identify, …
• Enables discovery of transformative processing chains
PDF Postscript ASCII
Transformpdf2ps
RenderNotepad
Transformps2ascii
HARVARD UNIVERSITY LIBRARY
Relationships
• Modification BWF → WAVE
– Extension DNG → TIFF 6.0
– Restriction PDF/A → PDF 1.4
• Definition NITF → XML DTD
• Requisite XML → Relax NG
• Containment ZIP → *
• Equivalence DXF (ASCII) → DXF (binary)
• Version Word 97 → Word 6.0
• Affinity SPIFF → JPEG
HARVARD UNIVERSITY LIBRARY
GDFR node
• Based on the OCLC IWSA / RFA framework
Canonical service layer
SRU/W OAISRU Update RSS Atom
XML RDBMS
Public service layer
Storage layer
Collection layer
Add DeleteUpdate Search
TCP/IP
AtomSRU/W
Display AdminDataContent History Export Import
Create
HARVARD UNIVERSITY LIBRARY
GDFR node
• Java, Apache/Tomcat, Berkeley DB XML
• GNU LGPL license
– Including pre-existing OCLC technology and technology newly-developed for the project
• Release schedule
– v0.1 (alpha) March 23, 2007– v0.1 (beta) June 14, 2007– v1.0 June 30, 2007– v1.1 August 12, 2007– v1.3 September 17, 2007– v1.3.1 October 26, 2007
HARVARD UNIVERSITY LIBRARY
GDFR node
HARVARD UNIVERSITY LIBRARY
GDFR node
HARVARD UNIVERSITY LIBRARY
GDFR node
HARVARD UNIVERSITY LIBRARY
GDFR network
• Peer-to-peer network of independent, but cooperating registries communicating over a common protocol
RootGDFR node
GDFR node
GDFR node
GDFR node
Editorial process
Submissions for technical vetting
Vetted for propagation
GDFR protocol
Data propagation
HARVARD UNIVERSITY LIBRARY
GDFR network
• Public notification of the availability of new data
– RSS feed available at well-known public address to which remote nodes can subscribe
• Remote harvesting of local data
– OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting)
• Initially, a single source (root node) for all new data
HARVARD UNIVERSITY LIBRARY
Project status
• Extensive internal testing of GDFR software in a stand-alone mode
• Current project activities are focused on
– Implementing the distribution and synchronization functions
– Building the network– Data acquisition– Succession planning
HARVARD UNIVERSITY LIBRARY
Initial population
• Manual addition is possible, but time consuming
• Automated update using Atom
• What sources are available for bulk population?
– PRONOM registry www.nationalarchives.gov.uk/pronom
– Library of Congress Format Description Documents (FDD) www.digitalpreservation.gov/formats/fdd/descriptions.shtml
– Unix / Linux magic(4) database
HARVARD UNIVERSITY LIBRARY
Subsequent population
• RFC 2026, Internet Standards Processwww.ietf.org/rfc/rfc2026.txt
– “Iterations of review by the ... community and revision based upon experience”
• Draft distribution and public discussion
• Approval by “area” editors
• Release to the network for distribution
HARVARD UNIVERSITY LIBRARY
Sustainability
• The technological solution is the (relatively) easy part, but…
– The technology is expendable
– The important point is for the data to survive, evolve, and expand
HARVARD UNIVERSITY LIBRARY
Governance and succession
• Mellon funding was for technical work only
• At the end of the two year project…
– Harvard will continue maintenance for up to two years– Library of Congress has agreed to be a care-taker
agency until a permanent body is identified
HARVARD UNIVERSITY LIBRARY
Governance and succession
• NARA GDFR governance investigation
– Part of the Electronic Records Archives (ERA) initiative
– GDFR Governance Workshop, November 2007
• Bibliothèque et Archives, Canada • NARA• Corp. for National Research Initiatives • NASA• Digital Curation Centre, UK • NIST• Digital Library Federation • National Library of Australia• General Services Administration • National Library of New
Zealand• Georgia Institute of Technology • San Diego Supercomputer
Center• Government Printing Office • Stanford University• Harvard University • Statens Archiv, Sweden• IBM Watson Research Center • Tessalla Support Services• Koninklijke Bibliotheek, Netherlands • University of Pennsylvania• Library of Congress• MIT
HARVARD UNIVERSITY LIBRARY
Administrative considerations
• Policy
– Who (and how many) can join the network?– What are the eligibility requirements?– What are the rights and obligations of membership?
• Technical
– Who will maintain and enhance the data model?– Who will maintain, enhance, distribute, and support
the software?
HARVARD UNIVERSITY LIBRARY
Administrative considerations
• Data
– Who will contribute data?– Who will vouch for data authenticity?– Who will ensure data integrity?
• Financial
– What are the real human and system costs associated with GDFR operation?
– Who pays, and how?
HARVARD UNIVERSITY LIBRARY
Summary
• The GDFR is an enabling technology that will support digital repository and preservation activities
– Supports the strong typing of digital assets at an appropriate level of granularity
– Enables the future recovery of the syntax and semantics associated with typed digital assets
– A means to pool and redistribute the expertise of the international digital preservation community
HARVARD UNIVERSITY LIBRARY
For more information…
www.formatregistry.org