Post on 23-Dec-2015
Harvard’s Digital Repository Service (DRS) Architecture
Harvard University Library (HUL)Andrea Goethals, Randy Stern
December 10, 2009
Today’s Agenda
1. What is the DRS?2. DRS 1 Architecture3. DRS 2 Highlights4. Questions
1. What is the DRS?
DRS Context A core portion of HUL’s mission is to
provide current and future access to research materials and resources, with recognition that preserving access to digital content requires different strategies, tools and skills
Digital Preservation projects and activities (2000-)
Digital Preservation Program (June 2008-) Centerpiece: the Digital Repository Service
(DRS)
What is the DRS? Set of professionally managed services for
preservation and access
metadata and content storage &
monitoringservice
creation & format
guidelines, training, ingest
service
delivery services,access restrictions, persistent names
preservationplanning
& activities,administration,
management tools
usecreation/acquisition
What’s in the DRS?
What’s in the DRS?
What’s in the DRS?
What’s in the DRS?
What’s in the DRS?
What’s in the DRS?
What’s in the DRS?
What’s in the DRS?
DRS by the numbers 103 TB of content
335 TB total (counting all copies) 13 M files
10 M image files 21,000 audio files 2.8 M text files 851,000 compressed Google books
containing 672 M files 6,300 compressed web harvests
containing 14 M web files
DRS growth
•Fueled by large projects•Recent explosion – mass digitization (Google book project)
0
20
40
60
80
100
120
Oct-00 Oct-01 Oct-02 Oct-03 Oct-04 Oct-05 Oct-06 Oct-07 Oct-08 Oct-09
TB
Broadening content and metadata requirements
New formats and genres, born-digital content Email archiving, more audio, drawing, video
Descriptive metadata, linkages to catalogs Rights management, more access
restrictions Auxiliary content
Contextual material, licenses, donor agreements, collection objects, documentation, repository agents
2. DRS 1 Architecture
DRS System Architecture
TCP/IP
NFS
Metadata Storage
Database
DRS Web Admin Tools
Delivery ServicesIngest Services
Consistency Validation Service Content Storage Service
Metadata Storage Database
DRS-1 Objects are modeled as related files
File Metadata: Administrative (owners, projects, deposit dates, owner IDs,
etc.) Technical (format mime-type & format specific data) Role, purpose, quality No descriptive metadata Access restrictions (public, Harvard-only, dark) MD5 file digest and byte count
Relationship triples “is_part_of”, “is_preservation_replacement_for”, etc. 21 relationship types ~13M files, 12.3M relationships
Content Storage ServiceBit preservation
Redundancy, heterogeneity, extensibility, scalability, simple file access protocol
Access demands high availability and high performance delivery
Functional requirements: At least three copies in three physical locations Two media types Two on-line copies for high availability One near-line copy, one off-line copy
Content Storage ServiceStorage provider
SUN SAM/QFS Storage Archive Manager 2 file classes: highuse and lowuse Archiving rules
High use files Copy 1 on disk at local server center Copy 2 on disk at remote server center Copy 3 on tape in library Copy 4 on tape off line at Harvard Depository
Low use files Copy 1 on disk at remote server center Copy 2 on tape in library Copy 3 on tape off line at Harvard Depository
High speed cache for access
Consistency Validation Service
Continuous monitoring for file system and database consistency Crawls the file system and confirms that every
disk file has a DRS metadata record Crawls the DRS metadata records table and
confirms that every file referenced exists in the file system
Confirms that the MD5 checksum for each file is the same as recorded in the database
Reports errors to administrators
Delivery and Access Services
Real time web delivery Image delivery service
JPEG, JPEG 2000, TIF, GIF Page turned object delivery service
METS + page images + page text Streaming delivery service
Real Audio File delivery service
PDFs Web Archiving Service Asynchronous delivery service
Archival masters
Administrative Services DRS Web Administrator
Searching, reporting, file operations, archival master download
Page Turned Object Maintenance METS structure editor
Name Resolution Service Maintenance URN create/update/report
DRS System Architecture
TCP/IP
NFS
Metadata Storage
Database
DRS Web Admin Tools
Delivery ServicesIngest Services
Consistency Validation Service Content Storage Service
DRS System ArchitectureIngest Services
TCP/IP
NFS
Metadata Storage
Database
DRS Web Admin Tools
Delivery Services
Consistency Validation Service Content Storage Service
DRS Loader
SFTP Drop
Boxes
BatchBuilder
DepositorsWeb Archiving
Service
DRS System ArchitectureDelivery Services
TCP/IP
NFS
Load BalancedDelivery Services
Metadata Storage
Database
DRS Web Admin Tools
Load BalancedDelivery Services
Catalogs – Web Sites - Google
Consistency Validation Service Content Storage Service
DRS Loader
SFTP Drop
Boxes
BatchBuilder
DepositorsWeb Archiving
Service
DRS System ArchitecturePersistent Naming and Access Services
TCP/IP
NFS
Load BalancedDelivery Services
Metadata Storage
Database
DRS Web Admin Tools
Load BalancedDelivery Services
Catalogs – Web Sites - Google
Access Management
Service
Name Resolution Service
Consistency Validation Service Content Storage Service
DRS Loader
SFTP Drop
Boxes
BatchBuilder
DepositorsWeb Archiving
Service
DRS System ArchitectureStorage Services
Disk archive (High use, copy 1)
Site 2 Boston
Site 1 Cambridge
Disk archive (High use, copy 2)
Disk archive (Low use. copy 1)
Tape archive (High use, copy 3)Tape archive (Low use, copy 2)
Media only
Tape archive (High use, copy 4)Tape archive (Low use, copy 3)
Site 3 Westborough
TCP/IP
NFS
Load BalancedDelivery Services
Metadata Storage
Database
DRS Web Admin Tools
Load BalancedDelivery Services
DRS Loader
Catalogs – Web Sites - Google
Access Management
Service
Name Resolution Service
SFTP Drop
Boxes
Consistency Validation Service
BatchBuilder
SAM/QFS
DepositorsWeb Archiving
Service
Storage ServicesImplementation
Sun SAM-QFS 4.6 Rule-based automatic archiving – no “backups” Unified file name space
Dual Sun T2000 Solaris SAM servers Redundant servers at site 1, DR failover at site 2 Nightly samfsdump from site 1 - samfsrestore at site 2
EMC CLARiiON disk storage arrays RAID 1+0 FC cache/ RAID 5 SATA Disk Archives 35TB CX3-40 at site 1, 109 TB CX3-80 at site 2
StorageTek SL500 tape library LTO-4
In production since Feb 2008
Storage ServicesRedundancy
Private TCP/IP
Sun T2000Solaris 10SAM-QFS
Sun T2000Solaris 10SAM-QFS
FC switch FC switch
4 GB cacheSP 4 GB cacheSP
EMC CX3-40FC / SATA, RAID 1+0 / RAID 5
Staging cacheDisk archive (High use, copy 1)
Off-site, HBSPOn-site, UIS
Sun T2000Solaris 10SAM-QFS
8 GB cacheSP 8 GB cacheSP
EMC CX3-80FC / SATA, RAID 1+0 / RAID 5
Disk archive (High use, copy 2)Disk archive (Low use. copy 1)
StorageTek SL 500LTO-4
Tape archive (High use, copy 3)Tape archive (Low use, copy 2)
Robot Drive Drive Drive Drive
Media onlyLTO-4
Tape archive (High use, copy 4)Tape archive (Low use, copy 3)
Off-site, HD
Public TCP/IP
SAMSAMSAMNFS NFS NFS
App serverWeb server
NFS HTTP
Metadata Storage ServiceImplementation
DRS metadata storage Oracle 10G Live production server – copy 1 Dataguard failover copy – copy 2 Legato Tape backups – copy 3
Ingest ServicesImplementation
Batch deposit of SIPs to SFTP drop boxes DRS Batch Loader operates 8AM-8PM 51 object owners – libraries, museums ~12 depositors 234 project codes Daily weekday deposits average ~60
GB/day
Delivery ServicesImplementation
High availability design Redundant public access servers
Delivery, access management, name resolution Cisco Content Switch Load balancing, sticky sessions MRTG monitoring
Change control – no downtime on updates RHE linux, java 1.5, tomcat Tomcat and log4j logging and statistics
3. DRS 2 Highlights
Scope of work Builds on the early 2008 storage upgrade 2008-~2013 Effects every part of the DRS!
Expanded data model New and different metadata Object descriptors Content models Preservation plans Enhanced deposit tools New management applications New backend services
First major release: Summer 2011
Object descriptors A METS metadata file per object on the file
system alongside content files Descriptive, administrative, preservation,
technical and structural metadata Describes the object, all its files and bitstreams
and related significant events Gives the metadata the same secure storage
as the content files Self-contained, portable objects
Some technical challenges Amount of metadata to store
Bitstream description Many elements (esp. MODS, MIX)
Efficient, scalable search implementation Database, index, combination?
Keeping metadata in sync Database, object descriptors on file system
Effect on system of continued growth Consistency checks, migrations, format analysis, etc.
HRCI requirements Email archiving
4. Questions?
andrea_goethals@harvard.edu
randy_stern@harvard.edu