Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some...
Transcript of Agenda, Day 2 · Stanford University Libraries May 2011. Agenda • Stanford University ... •Some...
Agenda, Day 208:30 – 08.35 Review of objectives and agenda
08:35 – 09:30 Infrastructure and tools
09:30 – 10:30 Case study: preservation activities at CDL
10:30 – 11:00 Morning break
11:00 – 12:00 Case study: preservation activities at Portico
12:00 – 12:30 Preservation initiatives and organizations: DataNet, DCC, DPC, IIPC, NDSA, OPF
12:30 – 14:00 Lunch
14:00 – Case study: preservation activities at Stanford
– 15:00 Other preservation resources
15:00 – 15:30 Afternoon break
15:30 – 16:00 Format characterization
16:00 – 16:30 Characterization in preservation workflows
16:30 – 17:00 Questions and discussion
Digital Preservation at Stanford University
Tom CramerChief Technology StrategistStanford University LibrariesMay 2011
Agenda• Stanford University
• First & Second Generation Digital Library
• Digitization Efforts
• The Stanford Digital Repository
– Preservation Core
– Management
– Access
Stanford University
“The Universityof Stanford” ?
Leland Stanford Junior Universityx
Stanford University• 15,000 students
• 8,000 graduate• 7,000
undergraduate• 2,000 faculty• 35,000 total
university community
• $3.4 billion annual operating budget• $17.2 billion endowment• Roots of Silicon Valley• One of the world’s leading research universities
Stanford’s Digital Library c. 2007
Typical of all first generation digital libraries?
1st Generation Digital Libraries
• Small scale digitization, largely focused on text & images
• Purpose built systems for specific content types – application focus
• Highly theoretical approach to digital preservation
• Anemic UI’s
2nd Generation Digital Libraries• Large scale digitization
• With more content types
• Multi-pathway workflows• Content use & reuse in an integrated
environment• Pragmatic approach to digital
preservation & full lifecycle of objects
• Infrastructure & service focus
Digitization Trends -- Drivers
• Boutique Large scale • Text & image text, image, audio,
video, software and more• Refresh of 1st generation delivery
systems with contemporary UI’s
Digitization Trends -- Responses
Replacing individual, handwroughtschemes with workflow-based systems, largely automated, with QA, exception handling and reporting that work for multiple content streams.
Management of full lifeycle of object, from physical object management through capture, preservation & access
Digitization at SULAIR
1. Robotic Book Scanning Lab2. Rare Book Scanning Lab3. Map Scanning Lab4. High End Imaging Lab5. Multipurpose (Sheet Feed, et al) Lab6. Media Preservation Lab7. Digital Forensics Lab
Stanford’s Legacy Media Counts
More than 20,000 handheld media objects in Special Collections alone
Legacy Media & Digital Forensics
• Files, operating systems & software• mss, correspondence, images,
records, data, etc.• Steps:
• Extraction• Forensic analysis• Archival processing & description• Access & emulation
• Paradigm shift for archivists, donors
Lifecycle Management = Integration
Lifecycle Management = Integration
Digitization & file processing are the easiest parts of any digitization initiative. Description, file management, collection management, access, and a holistic workflow uniting all pieces, is the real challenge.
Preservation at Stanford
• SDR is in production since Dec 2006•Now a second generation preservation
system• one component in a larger ecosystem of
digital library infrastructure
1997
needidentified
“Dark Cave”concept
‘02 ‘03 ‘04 ‘05 ‘06 ‘07
NDIIPPprototype redesign
1.0 inprod
‘08 ‘09
2.0 conceived
‘10
2.0 in prod
Three Major Areas of Preservation Needs• Digital Library
– Legacy collections– Digitized collections– Licensed, locally loaded content– Born digital collections
• Institutional Repository– Research data, – Publications, dissertations, – Learning objects, university assets
• External Depositors– Publishers– Discipline-specific repositories– Reciprocal deposits with peer institutions
Google Books (’00s of TB)Manuscripts (75 TB)Media (50 TB)Geospatial Data (10 TB)~30 other digi projects (15 TB)Purchased collections (25 TB)
Download, process and preserve 8 million volumes in SDR for...•local indexing,•text mining,•selective delivery, and •long-term access.
E.g., Google-Scanned Books
E.g., Monterey Jazz Festival
•Festival founded in 1958: longest running jazz festival in the world.
•Rich collection of recordings from inception, spanning over 50 years, in varying states of condition & decay.
•Archives held at Stanford’s Archive of Recorded Sound
•~800 audio recordings, 1.6 TB audio files in SDR
•~250 video recordings, 22 TB video files in SDR
Access: - complete database of digital
recordings online at collections.stanford.edu/mjf
- Access via in-site visit to ARS- New commercial releases on
MJF Records
E.g., National Geospatial Digital Archive
• Some 27,000 “at risk” geospatial objects
• TIFFs, GeoTIFFs, Shapefiles, Digital Elevation Models, Digital OrthophotoQuadrangle files
E.g., Preserving Virtual Worlds
Stanford University LibrariesSecond Life Open House,31 July 2009
E.g., Forensically Extracted Born Digital Files
•Digital Forensics lab extracting original computer files from legacy media
•Actively building pipeline from extraction to preservation store
•Support for both immediate and deferred archival processing & description
E.g., Electronic Theses and Dissertations
NSF Policy Position on Data Archiving 1
“NSF's policy position on data is straightforward:
”
1 National Science Foundation, Cyberinfrastructure Council. Cyberinfrastructure Vision for 21st
Century Discovery. March, 2007.
NSF and NIH Grants to Stanford
SDR 2.0 today
• 100+ TB of unique content • 300+ TB of managed data• 200,000+ objects• 62,000,000 files• 7 content types: books, images, audio,
video, manuscripts, GIS data, software• Integrated component of larger
environment
2008: SDR 1.0 In Production & Working, BUT…
• Custom code, maintained by evolving & smaller team– No Reuse of code within Stanford, or larger community
• Bottlenecks– Needed to be quicker to add new content types– Needed to be quicker to add new collections– Needed to decompose code into more granular components
• Largely a stand-alone system– Lacked flexible Management services for streamlined,
continuous content deposit workflows– “Dark Archive” – No access services for rich, self-service
patron access
SDR 1.0 Architecture: Strongly Rooted in OAIS
SDR 2.0: New Technical Architecture • Adopt Fedora as a metadata management
system– Clean mapping of new data model to Fedora
content models– Reuse same design pattern, core technology as in
DOR
• Support for parallelized & asynchronous operations– Multiple ingest streams to increase throughput– Decompose one process (e.g, “ingest”) into
discrete, loosely coupled operations (“checksum”, “package”, “transfer”)
• Adopt a RESTful architecture & common workflow service
SDR 2.0: New Technical Architecture
SDR 2.0: Robots & “WorkDo” Service
Complex Systems from Atomic Pieces
SDR 2.0: Revised Data Model SDR 1.x’s METS-based SIP, AIP and DIP, had many issues: – Each Transfer Manifest was content & collection
specific Doesn’t scale– Transfer manifests require too much interpretation and
analysis to change, augment– Too complex: Stanford METS structure breaks apart
related data across the object– Wraps (somewhat dynamic) metadata with (mostly
static) data files in same envelope– Recursive nature of transfer manifest makes
versioning self-referential, complex– No one speaks METS natively: depositors, SDR &
clients all forced to perform translation at handshakes
Content Structures and Flavors of Metadata
• Flexible data model can take any type of data, packaged in “bags”– A “bag” is a directory with
standardized top-level structure and syntax
• Minimizes analysis & processing required on ingest
• Preserves options for future processing & transformations based on future needs
Each object has seven discrete metadata files:– Identity metadata– Descriptive metadata– Content metadata
(aka structural metadata)
– Technical metadata– Rights metadata– Source metadata– Provenance metadata
SDR Deposits: Content Transfer via Bagit
druid/bagit-info.txt
: Stanford-Content-Metadata: data/metadata/contentMetadata Stanford-Identity-Metadata: data/metadata/identityMetadata Stanford-Provenance-Metadata:
data/metadata/provenanceMetadata /data
/metadata /contentMetadata /descMetadata /identityMetadata /provenanceMetadata /rightsMetadata/sourceMetadata /technicalMetadata
/content/file1/file2
:
Lessons Learned Over 5 Years
• Custom code, maintained by evolving & smaller team, was inefficient & unsustainable– Adopted Fedora for metadata management, Hydra for
application framework– Shared technology & design patterns with rest of digital
library ecosystem– API’s for management, ingest, retrieval, reporting
• Bottlenecks– Need to be quicker to add new content types & collections:
simplify the data model, support “Zip & SIP”– Need to increase the throughput to the storage layer led to
parallelization of processes
• Need to refine & hone the SDR service model– Complement Preservation with robust Management & Access
services
Preservation Is One Leg of a Stool
• Preservation without Access is pointless– Further, all signs points indicate that it is not
economically viable
• Access without Preservation is myopic
• Robust Management services are prerequisite for accessioning, archiving and providing access to content– The “pre-ingest” phenomenon
Can one system handle it all? or
Stanford’s Digital Library Ecosystem
Three Spheres: Management, Preservation and Access
Digitization, Deposit & Management
Preservation
Discovery & Delivery
Stanford Digital Repository (SDR): content agnostic, preservation repository
Specialty applications provide context-specific, user-facing deposit, and access services tailored to content types and disciplines
SDR in Stanford’s DL Ecosystem
Library Management Applications
EEMS (acquiring born digital content), digitization workflow, etc.
Institutional Repository
ETDs, open access articles, faculty “papers”, research data, web sites, etc.
SULAIR Digital Stacks
Delivery for text, images, mss, media, data, & curated collections
National Geospatial Digital Archive(NGDA)
Geospatial data
and SDR provides “back-office” preservation services: replication, auditing, migration, and retrieval in a secure, sustainable, scalable stewardship environment
E.g., Parker Manuscripts
•559 Anglo-Saxon manuscripts, 200,000 pages
•For each page:
22 MB JPEG2000 delivery surrogate22 MB JPEG2000 delivery surrogate110 MB submaster TIFF220 MB master TIFF SDR –
Preservation Core
Parker.stanford.edu: Rich web application, tailored for general public, medievalists
Separation of Concerns
• Scoped repository: differentiation between preservation (provided by SDR) and
…content management (provided by DOR)…access (provided by the Digital Stacks apps)
• Implications: – Reduces pressure on SDR to be all things to all
depositors, for all content– Reinforces need to provide managed & secure storage at
scale– Reinforces requirement to focus on fixity and integrity
services– Emphasizes need to integrate SDR to management &
access services through stable API’s
Management: Hydra-based Applications
Under Development…• SDR’s Front End – Institutional Repository for Stanford• Hypatia – Archival Arrangement, Description & Access• SDR Preservation Core Administrative Application
ETD’s –Electronic Theses & Dissertations
SALT –Self-Archiving Legacy Toolkit
EEMs –Everyday Electronic Materials
Hydra
• Joint development project among Stanford, University of Virginia, University of Hull and Fedora Commons
• Based on Fedora, Active Fedora and Ruby on Rails
• Reuse Blacklight & solr for search & browse within a hydra application
Fundamental Assumption #1
No single system can provide the full range of repository-based solutions for a given institution’s needs,
…yet sustainable solutions require a common repository infrastructure.
For instance…
An ETD solution…- Single PDF- With auxiliary data
files- Simple, prescribed
workflow- Integrated with
student administration system
- Streamlined UI for depositors, reviewers & readers
A digitization workflow system…- Potentially hundreds of
files type per object- Complex, branching
workflow- Sophisticated operator
(back office) interfaces
A general purpose institutional repository- Heterogeneous file types- Simple to complex
objects- General purpose user
interfaces
Distinct Application NeedsMore than one dozen distinct repository application needs across three institutions.
• Electronic theses & dissertations• Open access articles• Data curation application(s)• General purpose institutional repository• Manuscript & archival collection delivery• Library materials accessioning tools• Digitization workflow system• And more...
Shared, Primitive Functions• Deposit – uploading simple or multipart
objects, singly or in bulk• Manage – editing an object’s content,
metadata and permissions• Search – full text and fielded search
supporting both user discovery and administration
• Browse – sequential viewing of objects by collection, attribute or ad hoc filtering
• Deliver – viewing, downloading & disseminating objects through user and machine interfaces
Hydra Philosophy -- Technical• Tailored applications and workflows for
different content types, contexts and user interactions
• A common repository infrastructure• Flexible, atomistic data models• Modular, “Lego brick” services• Library of user interaction widgets• Easily skinned UI
One body, many heads
Fundamental Assumption #2
No single institution can resource the development of a full range of solutions on its own,
…yet each needs the flexibility to tailor solutions to local demands and workflows.
Hydra Philosophy -- Community• An open architecture, with many
contributors to a common core• Collaboratively built “solution bundles” that
can be adapted and modified to suit local needs
• A community of developers and adopters extending and enhancing the core
• “If you want to go fast, go alone. If you want to go far, go together.”
One body, many heads
Electronic Theses and Dissertations
• Automatic deposit to library as part of degree conferral• Built in digital collection building• Better access for patrons• Reduced expenses for students,
University, library processing• Increased visibility of and access to
Stanford research via catalog & Google• Built in preservation through Stanford
Digital Repository
Electronic Theses & Dissertation (ETD)
EEMs: Accessioning Born Digital Materials
Browser widget enables selector to capture the PDF, plus URL, title, author, copyright status, payment information, and comments, and route to Acquisitions.
EEMs: Accessioning Born Digital Materials
Dashboard enables item processing, ultimately leading to preservation in SDR and access via the catalog.
SALT: Digital Archives
SALT: Digital Archives
• Archiving unstructured and semi-structured data
• Allow access to semi-processed information,- with strong access & visibility controls- leveraging full text & entity extraction
• Ongoing enrichment of the archive- through self-annotation by the donor- through crowd-sourcing description and
organization
Component Based Architecture• Fedora as a metadata store• Well structured file system as data store• Solr index for rapid data access• Blacklight & Hydra: app logic & presentation• Atomic Services
– “Robots”: simple, autonomous scripts, providing small units of work in reusable packages
– “Services” provide common operations that support workflows across the environment
• “WorkDo”: lightweight workflow to orchestrate cascade of services
DOR & Digital Stacks Architecture
Digital Library Ecosystem
Growth in Disk and Computing at SULAIR
Stanford’s Digital Library, 2011The next generation of Digital libraries will be complex ecosystems made up of simple components.
Separate systems for digitization, management, preservation and access will enable pieces to be mixed and matched, supporting content streams from a variety of sources, and access by a variety of communities, services and tools.
Photo by Alun Salt. Used under CC Attribution-ShareAlike 2.0 Generic license.
LOCKSS• Lots of Copies Keeps Stuff Safe• Originated at Stanford University• Peer-to-peer, decentralized digital
preservation system• Focus is on scholarly articles
– 7100 e-journal titles, 470 publishers– Collects web-based content – Preserves it locally – Provides 100% post-cancellation access– Done with publisher permission
LOCKSS
Capture & Replication
LOCKSS
Audit & Healing
LOCKSS• Commodity Hardware & Open source
software & Appliance = very low cost• Follows traditional model of library-
based distribution and preservation– Lots of Copies– Locally Managed Copies
• Publisher permissions ensure legal coverage
• Extensible to other collections
LOCKSS• CLOCKSS: Controlled LOCKSS
– Not-for-profit archive for ensuring access to orphaned scholarly content
– One dozen major publishers + libraries• Private LOCKSS Networks
– Alabama Digital Preservation Network– Arizona State Library, Archive & Public Records– Council of Prairie & Pacific University Libraries
Consoritum– Data Preservation Alliance for the Social Sciences– Digital Commons – Berkely Electronic Press– MetaArchive Cooperative Project– Digital Federal Depository Library Program