Trusted Datagrids: Library of Congress Projects with UCSD
description
Transcript of Trusted Datagrids: Library of Congress Projects with UCSD
Trusted Datagrids:Library of Congress Projects with UCSD
Ardys Kozbial – UCSD Libraries
David Minor - SDSC
Building Trust in a 3RD Party Repository: A Pilot Project
David Minor San Diego Supercomputer Center
How can the LC trustsomeone they can’t control?
Moving forward in the right direction requires more than fuzzy promises
… it takes a combination of experts and tools.
Cyberinfrastructure
Cyberinfrastructure is the collection of ...
Resources
+ Glue
Computers, data storage, networks,scientific instruments, experts, etc.
Integrating software, systems, and organizations
“Effective cyberinfrastructure for the humanities and social sciences will allow scholars to focus their intellectual and scholarly energies on the issues that engage them, and to be effective users of new media and new technologies, rather than having to invent them.”
- ACLS Commission on Cyberinfrastructure for the Humanities & Social Sciences
•“The mission of the San Diego Supercomputer Center (SDSC) is to empower communities in data-oriented research, education, and practice through the innovation and provision of Cyberinfrastructure”
SDSC ...
• Is one of the original NSF supercomputer centers • Supports high performance computing systems
• Supports data applications for science, engineering, social sciences, cultural heritage institutions
• Has LARGE data capabilities• 3+ PB Disk Storage• 25+ PB Tape Storage
UCSD Libraries
• 3.5+ million volumes
• Digital Access Management System (in development)
• 250,000+ objects• 15+ TB
• Shared collections with UC• California Digital Library
• Digital Preservation Repository• eScholarship repository
Partnerships and Collaborations
LC Pilot Project – Building Trust in a 3rd Party Repository
– Using test image collections/web crawls ingest content to SDSC repository
– Allow access for content audit– Track usage of content over time– Deliver content back to LC at end of project
Library of Congress NDIIPP Chronopolis Program– Build Production Capable Chronopolis Grid (50 TB x 3)– Further define transmission packaging for archival communities– Investigate best network transfer models for I2 and TeraGrid networks
California Digital Library (CDL) Mass Transit Program– Enable UC System Libraries to transfer high-speed mass digitization
collections across CENIC/I2– Develop transmission packaging for CDL content
UCSD Libraries’ Digital Asset Management System– RDF System with data managed in SRB at SDSC
SDSC DPI Group
Digital Preservation Initiatives Group– Charged with Developing and Supporting
Digital Preservation Services within the Production Systems Division of SDSC.
– http://dpi.sdsc.edu– Cross-Organizational Group
• SDSC Personnel/UCSD Libraries Personnel– Libraries– Archives– Technology– Information Science
CyberinfrastructureTrust
For Example:
We worked together to setup high speed data replication services
Checksums
Checksums
Achieved 200Mb/s
= 2 TB/day
Highly reliableInternet2
Network setup involved …
LC and SDSC staff working together
Configurations on networks and computers
Resolving different security environments
Network monitoring
Networking is hard!Networking is hard!
Can’t forget it once it’s setupCan’t forget it once it’s setup
It’s not magic - there’s always a reasonIt’s not magic - there’s always a reason
It highlights collaborative nature of workIt highlights collaborative nature of work
LessonsLearned
Has a long-term solution been found?Has a long-term solution been found?
Have multi-institutional issues been solved?Have multi-institutional issues been solved?
Does new infrastructure improve process?Does new infrastructure improve process?
TrustElements
Is solution useful for other organizations?Is solution useful for other organizations?
SDSC created a robust storage environment for this data
Multiple replications …
… at SDSC
… and geographically
diverse locations
(a process with several characteristics)
Needed to replicate structure exactly
This had to be done for 5+ replications
Complex environment had to be transparent
Data had to be available for manipulation
The Storage Resource Broker provided replication services ...
... and extensive monitoring, logging and reporting functions(which led to many conversations)
Logging and monitoring procedures
Scripts which compared the files within the system with a master list – checked changes on either side … fairly straightforward
But …
What is the master list and who maintains it?
Who decides what is a legitimate change?
Do you want a dark archive or an active remote data center?
We tested a new Front-End
… and explored an important issue
“Reliability”
Versus
“Accessibility”
Always keep expectations alignedAlways keep expectations aligned
Don’t confuse accessibility and reliabilityDon’t confuse accessibility and reliability
Duplication of structure is complicatedDuplication of structure is complicated
Communication highlights communicationCommunication highlights communication
LessonsLearned
Can remote data be accessed?Can remote data be accessed?
Can remote data be retrieved and re-used?Can remote data be retrieved and re-used?
Can remote data be verified?Can remote data be verified?
Can ownership be clearly defined?Can ownership be clearly defined?
TrustElements
50,000 ARC files
6 Terabytes of data
Short processing time
Parallel indexing and display system
Looked “default” to the user
SDSC and LC explored a new approach to working with web archives
Using default tools, our initial indexing rate was 1000 files per day…
This was over our time budget.… more than 6 weeks of constant computing to index entire collection.
We ran 18 parallel indexing instances – reduced processing to a week
We modified the Wayback sourcecode to create a new
access infrastructure
Sometimes you need to start overSometimes you need to start over
Default setup isn’t always easiestDefault setup isn’t always easiest
Time is a wonderful motivatorTime is a wonderful motivator
Experts are often interested in your workExperts are often interested in your work
LessonsLearned
Can a new organization bring new expertise?Can a new organization bring new expertise?
Are the final results the same?Are the final results the same?
Can the results be reached in a better way?Can the results be reached in a better way?
Can a new organization work with your partners?Can a new organization work with your partners?
TrustElements
Next steps ….
Chronopolis!
Chronopolis: A Partnership
Chronopolis is being developed by a national consortium led by SDSC and the UCSD Libraries.
Initial Chronopolis provider sites include:
SDSC and UCSD Libraries at UC San Diego
University of Maryland
National Center for Atmospheric Research (NCAR) in Boulder, CO
UCSD Libraries
Institutions and Roles - UCSD
SDSC– Storage and networking services– SRB support– Transmission Packaging Modules
UCSD Libraries– Metadata services (PREMIS)– DIPs (Dissemination Information
Packages)– Other advanced data services as
needed
Institutions and Roles - NCAR
National Center for Atmospheric Research
–Archives: Complete copy of all data
–Storage and network support
–Network testing
Institutions and Roles - UMIACS
University of Maryland – Institute for Advanced Computer Studies
– Archives: Complete copy of all data
– Advanced data services • PAWN: Producer – Archive Workflow Network in Support of Digital Preservation
• ACE: Auditing Control Environment to Ensure the Long Term Integrity of Digital Archives
– Other advanced data services as needed
SDSC Chronopolis Program
Chronopolis VocabularyPartners – UCSD Libraries, National Center for Atmospheric Research, University of Maryland Institute for Advanced Computer Studies all provide grid enabled storage nodes for Chronopolis services.
Clients – ICPSR, CDL– contribute content to the Chronopolis preservation network.
SRB – Storage Resource Broker – datagrid software.
iRODS – integrated Rule Oriented Data System – datagrid software.
ACE – Audit Control Cnvironment – part of the ADAPT project at UMD.
PAWN – Producer Archive Workflow Network – part of the ADAPT project at UMD.
INCA – user level grid monitoring - executes periodic, automated, user-level testing of Grid software and services – grid middleware.
Bagit – Transfer specification developed by CDL and the Library of Congress.
GridFTP – parallel transfer technology - moves large collections within a grid wide-area network.
Chronopolis: Inside
Linked by main staging grid where data is verified for integrity, and quarantined for security purposes.
Collections are independently pulled into each system.
Manifest layer provides added security for database management and data integrity validation.
Benefits– 3 independently
managed copies of the collection
– High availability– High reliability
NCAR
SDSCCore Center Archive
SDSCStagingGrid
PullPull
Chron Clients:CDLICPSR
Pull
Push
UMD
Copy 1
Copy 2Copy 3
ManifestManagementMCAT DBMultiple Hash Verifications
Grid Brick Disks
MCAT
MCAT
MCAT
HPSSTape
Grid Brick Disks
SDSC Leveraged Infrastructure Serves Both
HPC & Digital Preservation
Archive 25 PB
capacity Both HPSS &
SAM-QFS
Online disk ~3PB total HPC parallel
file systems Collections Databases
Access Tools
Adapted from Richard Moore (SDSC)
Chronopolis Demonstration ProjectDemonstration Project 2006-2007
– Demonstration Collections Ingested within Chronopolis
• National Virtual Observatory (NVO)– 3 TB Hyperatlas Images (partial
collection)
• Library of Congress PG Image Collection
– 600 GB Prokudin-Gorskii Image Collection
• Interuniversity Consortium for Political and Social Research (ICPSR)
– 2TB Web Accessible Data
• NCAR Observational Data– 3TB Observational Re-Analysis Data
NDIIPP Chronopolis Project
• Creating a 3-node federated data grid at SDSC, NCAR and UMD – up to 50 TB data from CDL and ICPSR
• Installing and testing a suite of monitoring tools using ACE, PAWN, INCA
• Creating Appropriate Transmission Information Packages
• Generating PREMIS definitions for data
• Writing Best Practices documents for clients and partners
Chronopolis Grid FrameworkSun 614062TB
Sun 614062TB
SRB D-Broker
SRB D-Broker
SRB MCAT
Sun SAM-QFS
Sun SAM-QFS
SRB D-Broker
SRB D-Broker
SRB MCAT
Apple XsanApple Xsan
SRB D-Broker
SRB D-Broker
SRB MCAT
CDL Server
ICPSR Server
NCAR Network
MarylandNetwork
SDSC Network
ICPSR Network
UC BerkeleyNetwork
Chronopolis Data 12-25TB
Chronopolis Data 12-25TB
Chronopolis Data 12TB
Chronopolis Data 12TB
CDL Server
SDSC Network
NCAR Network
UMD Network
Tape SilosTape Silos
Adapted from Bryan Banister (SDSC)
NDIIPP Chronopolis Clients-CDLCalifornia Digital Library
–A part of UCOP, supports the University of California libraries
– Providing up to 25TB of data: Web-At-Risk project• Five years of political and
governmental websites• ARC files created from web crawls• Using Bagit Transfer Structure
Diagram of CDL Data TransferCDL Virtual Machine at UCB
SDSC Network
Wget Bagit
Wget files 1-10, 11-20
File n
BagitManifest
File 1
Possible SRB/BagitModule
UM
IACS
ChronStaging
ChronRepository
NCAR
Parallel Wget Xfer
UMIACS Network
NCARNetworkAdapted from Bryan Banister (SDSC)
NDIIPP Chronopolis Clients-ICPSR
Inter-University Consortium for Political and Social Research, University of Michigan
– Providing @12TB of data: Wide variety of types
– Already working with SDSC using SRB
Diagram of ICSPR Transfer
ICPSR SRB RepositoryUMich
SDSC Network
Sput/Srsync Files
Sput tar files
File n
EMCSAN
File 1
ChronSRBMCAT
UM
IACS
ChronStaging
ChronRepository
NCAR
Parallel Sput/Srsync Xfer
UMIACS Network
NCARNetworkAdapted from Bryan Banister (SDSC)
Ongoing and Future Initiatives
• Migration of Chronopolis from SRB to iRODS
• Develop Interoperability with Community Based Archival Systems/Standards
• TRAC compliance for SDSC Production Preservation Services/Chronopolis Consortium
Looking for Partnerships
• Repositories interested in moving large digital collections among heterogeneous repository systems.• Fedora, DSpace or E-Prints sites interested in managed datagrid storage.• Institutions interested in personnel swaps to conduct TRAC audit assessment compliance.• Community Needs for Mass-Scale Data Transmission and Storage.
Chronopolis Credits
SDSC– Fran Berman– Richard Moore– David Minor– Chris Jordan– Jim D’Aoust– Robert McDonald– Don Sutton– Brian Banister– Phong Dinh– Jay Dombrowski– Emilio Valente
UCSD Libraries– Brian Schottlaender– Luc Declerck– Ardys Kozbial– Brad Westbrook– Arwen Hutt
NCAR– Don Middleton– Michael Burek– Linda McGinley
UMIACS– Joseph JaJa– Mike Smorul– Mike McGann
Library of Congress– Martha Anderson– Lisa Hoppis
CACI– Mike Ivey
• a geographically distributed preservation environment that supports long-term management and stewardship of digital collections
• implemented by developing and deploying a distributed data grid, and by supporting its human, policy, and technological infrastructure.
• technology forecasting and migration in support of long-term life-cycle management of the dedicated preservation environment.
Chronopolis is ...
• Assessment of the needs of potential user communities and development of appropriate service models
• Development of Memoranda of Understanding (MOUs), Service Level Agreements (SLAs), etc. to formalize trust relationships and manage expectations
• Assessment and prototyping of best practices for bit preservation, authentication, metadata, etc.
• Development of cost and risk models for long-term preservation
• Development of appropriate success metrics to evaluate usefulness, reliability, and usability of infrastructure
Chronopolis focuses on ...
UCSD Libraries
The people of Chronopolis are ...
Organizations need ways to validate trust in 3rd partiesIn conclusion …
… and demonstrating trust.
SDSC and the Library of Congress explored one way to do this …
by working with Cyberinfrastructure
With a trusted relationship, many journeys become possible