Northwestern digital repository initiative: platform and persistence
-
Upload
center-for-scholarly-communication-digital-curation -
Category
Education
-
view
176 -
download
3
description
Transcript of Northwestern digital repository initiative: platform and persistence
Northwestern digital repository initiative:
Platform and persistence
Claire StewartDirector, Center for Scholarly Communication and Digital CurationHead, Digital Collections, Library Technology DivisionNorthwestern [email protected]
What is a repository and why should I care?
Library as institutional memory
Tweeted in 2012 by Gail Steinhart, Head of Research Services, Mann Library, Cornell University
Vines, T. H., Albert, A. Y. K., Andrew, R. L., Débarre, F., Bock, D. G., Franklin, M. T., … Rennison, D. J. (2013). The Availability of Research Data Declines Rapidly with Article Age. Current Biology, 24(1), 94–97. doi:10.1016/j.cub.2013.11.014
“The major cause of the reduced data availability for older
papers was the rapid increase in the proportion of data sets
reported as either lost or on inaccessible storage media. For
papers where authors reported the status of their data, the
odds of the data being extant decreased by 17% per year
(Figure 1D).” [emphasis added]
The Availability of Research Data Declines Rapidly with Article Age
What is a repository and why should I care?
A concept
TheRepository
All the stuff
A set of technologies
Technologies and architecture
Repository as service• Description and characterization - descriptive, provenance and technical
metadata
• Selection, conversion, digitization
• Deposit and versioning
• Interoperability, APIs for ingest, discovery
• Access control, copyright support and other legal/regulatory compliance
• Persistence –Stable, permanent links (URLs, DOIs, etc.)
–Health of digital objects
–Replication and dark archiving
–Migration or emulation, virtualization
What’s already in our repository
digital.library.northwestern.edu
Maps of Africa
First Fedora project @ NU
2006 project, internally funded
116 antique maps at high resolution
Maps in Fedora
METS, PREMIS, JPEG2000
Archival finding aids
findingaids.library.northwestern.edu Archon for EAD, Fedora + Blacklight for storage and discovery, Primo syndication
Winterton Collection
Northwestern Books and the Book Workflow Interface
2009
Mellon-funded
Now used for all in-house book digitization
books.northwestern.edu
Every page of each digitized book has this information:Datastream ID MIMETYPE Schema/ontology
Dublin Core metadata DC text/xml OAI_DC
MODS metadata MODS text/xml MODS
Relationship metadata RELS-EXT text/xml RELS-EXT
OCR PDF file PDF application/pdf
OCR XML OCR XML text/xml ABBYY OCR
OCR Text OCR TEXT text/plain
Source camera image file ARCHV-IMG image/jpeg
Source technical metadata in MIX ARCHIV-TECHMD text/xml MIX
Source camera technical metadata in EXIF ARCHV-EXIF text/xml Exif as XML
Corrected image file PROC-IMG image/jpeg
Corrected image technical metadata in MIX PROC-TECHMD text/xml MIX
Delivery image JPEG2000 file DELIV-IMG image/jp2
Delivery image technical metadata in MIX DELIV-TECHMD text/xml MIX
SVG for delivery mechanism DELIV-OPS text/xml SVG
Viewer html HTML text/html HTML
By the numbers — # of objectsAs of November 2013:
• Finding aids: 1,114
• Digitized books: 3,491
• Digitized book pages: 835,806
• Image objects: 216,271
• A few others, including 3D objects, and collection objects
A total of 1,187,414 objects in the repository
Every object has several datastreams (files, descriptive metadata, technical metadata, etc.)
By the numbers — storageAs of Feb 5, 2014:97.1 TB of content on repository (including digitized collections
queued for ingestion) and JPEG2000 server.
Library & NUIT purchased 200 TB of storage replicated between Evanston and Chicago campuses (that is over 400 TB in total).
Digital preservation/persistence• Persistent URLs• Mirrored storage (as of fall 2014)• PREMIS (preservation) metadata• Routine health checks for data• Geographically distributed storage• Dark archives• Migration/virtualization services
Distributed storage and dark archives
• DuraCloud• Amazon Glacier• Digital Preservation Network (DPN)
Current repository projects
• Digital Image Library (DIL)
• Avalon
• Hydramata
HydraNorthwestern joined 2011
Framework for repository applications using Ruby on Rails
Community with 22 partners
2007 Provost funded move from Art History to the Library, expansion to other disciplines
115,000 images in Hydra + Fedora
Moving all legacy digital collections into DIL & its Hydra counterparts in 2014-2015
images.northwestern.edu
Digital Image Library (DIL)
AvalonIMLS-funded project with
Indiana UniversityReleases:• 0 July 2012
• .5 October 2012
• 1.0 May 2013
• 2.0 October 2013 (NU pilot)
First NU production with R3, expected in next month
media.northwestern.edu (dev/demo)
Scholarly communication and digital curation
• Options for archiving scholarly materials
• Authors rights, copyright help and education, open access support
• E-science and research data life cycle
• Digital humanities
• Library-based publishing
• Responding to funder requirements
Hydramata (formerly Shared IR)
Five-institution project to develop a next-generation institutional repository solution in Hydra
Expanding our repository program• Massive storage, planning for growth, sustainability
• Digital preservation serviceso Offsite third copy (DPN, DuraCloud, Glacier)o Verification services
• Research computingo Research data lifecyle - how to capture metadata early? what to
keep?o Automate deposit from Vault?
• Shared infrastructure and services whenever possible
• Deeper collaboration with NUIT, Research, central admin, schools
Discussion and questionsClaire StewartDirector, Center for Scholarly Communication and Digital CurationHead, Digital Collections, Library Technology DivisionNorthwestern [email protected]