Data Migration and Policy Management - MSST...

117
Data Migration and Policy Management Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center {moore,schroede,sekar,mwan}@sdsc.edu http://irods.sdsc.edu http://www.sdsc.edu/srb/

Transcript of Data Migration and Policy Management - MSST...

Page 1: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Data Migration and Policy Management

Reagan W. MooreWayne SchroederArcot Rajasekar

Mike WanSan Diego Supercomputer Center

moore,schroede,sekar,[email protected]://irods.sdsc.edu

http://www.sdsc.edu/srb/

Page 2: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Preservation Environments

• NARA Transcontinental Persistent Archive Prototype

• NSF National Science Digital Library persistent archive

• NHPRC Persistent Archive Testbed• California Digital Library - Digital

Preservation Repository• UCSD Libraries image archive

Page 3: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Challenge

• Migrate records to a new preservation environment through migration of media• Is this feasible?

• What are the implications on trustworthiness?• Authenticity• Integrity• Chain of custody

Page 4: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

OAIS Reference Model• Representation information is the context

required to interpret a file• Format• Parsing application• Descriptive metadata• Knowledge community

• Projects developing representation information for records include:• Open Archival Information System reference model• CASPAR - Cultural, Artistic and Scientific knowledge

for Preservation, Access and Retrieval• Planets - Preservation and Long-term Access through

Networked Services

Page 5: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Migration Concept

• Representation information for data management systems• Context that describes the management policies

and management procedures that enforce trustworthiness properties on the media

• The context describing management of media is as important as descriptions of the internal structure of the content on the media

Page 6: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Rule-Oriented Data System

• Representation information for the data management system• Rules used to enforce management policies• Micro-services that are composed to create

management processes• Rules are applied at the remote storage

system to control the execution of the micro-services• Can embed in the system the constraints that enforce

management policies• Can query persistent state information to verify

compliance

Page 7: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Rule-Oriented Data SystemClient Interface Admin Interface

Current State

Rule Invoker

MicroService

Modules

Metadata-based Services

Resources

MicroService

Modules

Resource-based Services

ServiceManager

ConsistencyCheck

Module

RuleModifierModule

ConsistencyCheck

Module

Rule

Confs

ConfigModifierModule

MetadataModifierModule

MetadataPersistent

Repository

ConsistencyCheck

ModuleRuleBase

Page 8: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Representation InformationPolicies

• Examples of management policies and information specific to media• Maximum number of times that the media should be read

versus number of times the media has been read• Frequency with which the media should be checked for

data corruption• Management of encodings (error correction,

compression)• Management of known bad sectors• Management of vendor-defined containers for

aggregating data and metadata within the media• Management of media labels (internal, external) • Management of large files - links between media

Page 9: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Representation InformationProcesses

• Examples of management processes that are applied on the media• Updates to media state information • Trustworthiness assessment evaluation• Error correction and compression algorithms• Actions on bad sectors within the media• Manipulation of containers

Page 10: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Why Rule-based Systems?

• Cannot rely upon any component of a distributed storage environment• Media corruption• Systemic vendor software problems• Operational error• Natural disaster• Malicious users

• Use rules to prove that the media contains the desired data, that the data is authentic, that a chain of custody for the data has been maintained, and that associated provenance information is correctly linked.

Page 11: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Emerging Standard for Trustworthiness

• RLG/NARA TRAC - Trustworthy Repositories Audit & Certification: Criteria and Checklist.http://wiki.digitalrepositoryauditandcertification.org/pub/Main/ReferenceInputDocuments/trac.pdf

• Defines three categories of criteria• A - Organizational Infrastructure• B - Digital Object Management• C - Technologies, Technical Infrastructure, & Security

Page 12: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

TRAC Assessment CriteriaB1.1 Repository identifies properties it will preserve for digital objects.B1.2 Repository clearly specifies the information that needs to be associated with digital material at the time of its deposit (i.e., SIP).B1.3 Repository has mechanisms to authenticate the source of all materials.

B1.4 Repository’s ingest process verifies each submitted object (i.e., SIP) for completeness and correctness as specified in B1.2.B1.5 Repository obtains sufficient physical control over the digital objects to preserve them.B1.6 Repository provides producer/depositor with appropriate responses at predefined points during the ingest processes.B1.7 Repository can demonstrate when preservation responsibility is formally accepted for the contents of the submitted data objects (i.e., SIPs).B1.8 Repository has contemporaneous records of actions and administration processes that are relevant to preservation (Ingest: content acquisition).

Page 13: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

TRAC Assessment Criteria

• For each file within the media, there is required preservation state information.

• Information is also required about the media itself:• B3.4 Repository can provide evidence of the

effectiveness of its preservation planning.• This can be interpreted as the status of error

recovery operations and the generation of periodic reports summarizing media integrity

Page 14: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

TRAC Assessment Criteria

B4.1 Repository employs documented preservation strategies.B4.2 Repository implements/responds to strategies for archivalobject (i.e., AIP) storage and migration.B4.3 Repository preserves the Content Information of archivalobjects (i.e., AIPs).B4.4 Repository actively monitors integrity of archival objects (i.e.,AIPs).B4.5 Repository has contemporaneous records of actions andadministration processes that are relevant to preservation (ArchivalStorage).

Page 15: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

TRAC Assessment CriteriaC1.1 Repository functions on well-supported operating systems and other coreinfrastructural software.C1.2 Repository ensures that it has adequate hardware and software support forbackup functionality sufficient for the repositoryÕs services and for the data held,e.g., metadata associated with access controls, repository main content.C1.3 Repository manages the number and location of copies of all digital objects.C1.4 Repository has mechanisms in place to ensure any/multiple copies of digitalobjects are synchronized.C1.5 Repository has effective mechanisms to detect bit corruption or loss.C1.6 Repository reports to its administration all incidents of data corruption orloss, and steps taken to repair/replace corrupt or lost data.C1.7 Repository has defined processes for storage media and/or hardware change(e.g., refreshing, migration).C1.8 Repository has a documented change management process that identifieschanges to critical processes that potentially affect the repositoryÕs ability to complywith its mandatory responsibilities.C1.9 Repository has a process for testing the effect of critical changes to the system.C1.10 Repository has a process to react to the availability of new software securityupdates based on a risk-benefit assessment.

Page 16: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS

• integrated Rule-Oriented Data System• http://irods.sdsc.edu

• Data grid• Enables the creation of a shared collection

from data distributed across multiple storage systems, administrative domains, continents

• Automates the application of management policies

Page 17: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Preservation Concepts• Infrastructure independence

• Manages properties of the files independently of the storage system

• Data virtualization• Manages names assigned to files, storage, users• Implements standard operations performed upon the

storage system• Trust virtualization

• Manages authentication and authorization• Management virtualization

• Implements management policies and management processes

Page 18: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Mapping TRAC Criteria to iRODS Rules

• Apply atomic, deferred, and periodic rules

• Generated 105 micro-services needed to implement TRAC criteria• Generate monthly report on risk (list all

incidents, date, type of incident, number of files lost, recovery procedure)

• Analyze audit trails to verify identity of all persons accessing the data, and compare their roles with desired access controls

Page 19: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Classes of Assessment Criteria• Collection properties

• List properties of associated name spaces• Verify properties• Compare properties with assertions

• Collection operations• Transform file formats• Migrate data• Generate audit trails

• Structured information• Parse audit trails to generate compliance reports• Apply templates to extract information• Apply templates to format state information

Page 20: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Mounted Collection Interface

• Generic infrastructure for manipulating structured information • Storage System

• Remote directory

• Structured file• Archival Information Package• Tar file / XFDU / HDF5 / METS package / XAM

• Database• Tabular information / BLOBS

• Bulk operations• Registration

Page 21: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Structured Information Access• MountCollCreate - create a new MountColl .• MountCollOpen - open a MountColl file for query.• MountCollRead - query for subfiles and directory structure.• MountCollClose - close an opened MountColl .• MCollSubCreate - create a sub-file in a MountColl .• MCollSubOpen - open an existing sub-file in a MountColl .• MCollSubRead - read the content of an opened sub-file.• MCollSubWrite - write to an opened sub-file.• MCollSubClose - close an opened sub-file.• MCollSubunlink - delete an existing sub-file.• MCollSubStat - get the status of an existing sub-file.• MCollSubFstat - get the status of an opened sub-file.• MCollSubLseek - lseek into an opened sub-file.• MCollSubRename - rename an existing sub-file.

Page 22: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Preservation Policies

• Migrate files to new media • Minimize cost• Recover floor space• Improve access rate• Forced by media lifetime

• Integrity checks• Detect corruptions (missing tape, broken tape,

bit errors)• Retension tape

Page 23: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

National Archives and Records Administration Transcontinental Persistent Archive Prototype

U Md SDSC

MCAT MCAT

Georgia Tech

MCAT

Federation of Seven Independent Data Grids

NARA II

MCAT

NARA I

MCAT

Extensible Environment, can federate with additional research and education sites. Each data grid uses different vendor products.

Rocket Center

MCAT

U NC

MCAT

Page 24: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

For More Information

Reagan W. MooreSan Diego Supercomputer Center

[email protected]

http://www.sdsc.edu/srb/http://irods.sdsc.edu/

Page 25: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Data Management Applications

• Data grids• Share data - organize distributed data as a collection

• Digital libraries• Publish data - support browsing and discovery

• Persistent archives• Preserve data - manage technology evolution

• Real-time sensor systems• Federate sensor data - integrate across sensor streams

• Workflow systems• Analyze data - integrate client- & server-side workflows

Page 26: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Your BackgroundsClass Exercise

• Users of data management technology

• Managers of data systems

• Developers of data systems

Page 27: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Data Management SystemsWhich do you use? - Class Exercise

• Storage Resource Broker (SRB)

• Globus Tool Kit

• Internet Backplane Protocol / Lstore

• Lots of Copies Keep Stuff Safe (LOCKSS)

• DSpace / SRB

• Parrot

• Integrated Rule-Oriented Data System (iRODS)

• .net SECPAL

Page 28: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

What are your specific data management requirements?

• Access mechanism• GridFTP (3), load libraries (Python), MPI-IO (SEMPLAR)

• Organization• File-based, collection-based, block-based, semantically,

commonality

• Description• XML, METS, OAIS, OAI-PMH, community-specific

• Management policies• Security (time-dependent), retention/disposition, integrity

• Collection properties• Authenticity, Authoritative sources

Page 29: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

What are your specific data management requirements?

• Sharing• Collaboration environment• Computational model sharing

• Standards• Community specific format, semantic terms,

services• Analysis

• Reasoning on system for compliance

Page 30: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Generic Infrastructure

• Data grids manage data distributed across multiple types of storage systems• File systems, tape archives, object ring buffers

• Data grids manage collection attributes• Provenance, descriptive, system metadata

• Data grids manage technology evolution• At the point in time when new technology is

available, both the old and new systems can be integrated

Page 31: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Extremely Successful• Storage Resource Broker (SRB) manages 2 PBs of data

in internationally shared collections• Data collections for NSF, NARA, NASA, DOE, DOD, NIH,

LC, NHPRC, IMLS• Astronomy Data grid• Bio-informatics Digital library• Earth Sciences Data grid• Ecology Collection• Education Persistent archive• Engineering Digital library• Environmental science Data grid• High energy physics Data grid• Humanities Data Grid• Medical community Digital library• Oceanography Real time sensor data, persistent archive • Seismology Digital library, real-time sensor data

• Goal has been generic infrastructure for distributed data

Page 32: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Date

Project GBs of data stored

1000Õs offiles

GBs of data stored

1000Õs offiles

Users with ACLs

GBs of data stored

1000Õs offiles

Users with ACLs

Data Grid NSF / NVO 17,800 5,139 51,380 8,690 80 88,216 14,550 100 NSF / NPACI 1,972 1,083 17,578 4,694 380 38,147 7,715 380 Hayden 6,800 41 7,201 113 178 8,013 161 227 Pzone 438 31 812 47 49 27,914 16,106 68 NSF / LDAS-SALK 239 1 4,562 16 66 202,312 166 67 NSF / SLAC-JCSG 514 77 4,317 563 47 21,644 2,330 55 NSF / TeraGrid 80,354 685 2,962 280,247 7,235 3,267 NIH / BIRN 5,416 3,366 148 21,000 35,301 445 NCAR 36,689 268 2 LCA 3,445 74 2Digital Library NSF / LTER 158 3 233 6 35 260 42 36 NSF / Portal 33 5 1,745 48 384 2,620 53 460 NIH / AfCS 27 4 462 49 21 733 94 21 NSF / SIO Explorer 19 1 1,734 601 27 2,750 1,202 27 NSF / SCEC 15,246 1,737 52 168,931 3,545 73 LLNL 16,931 1,895 5 CHRON 12,634 6,299 5Persistent Archive NARA 7 2 63 81 58 4,989 6,390 58 NSF / NSDL 2,785 20,054 119 7,188 77,479 136 UCSD Libraries 127 202 29 5,158 1,319 29 NHPRC / PAT 2,576 966 28 RoadNet 3,174 1,321 30 UCTV 7,140 2 5 LOC 6,644 192 8 Earth Sci 5,869 647 5TOTAL 28 TB 6 mil 194 TB 40 mil 4,635 975 TB 185 mil 5,539

5/17/02 6/30/04 9/4/07

Page 33: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS vs Storage Resource Manager

• SRB - manage data• Quotas

• Data redirection to alternative• Logical name

• Transfer on logical name• Synchronous/Asynchronous

• Deferred operations• Clients

• GridFTP, SRB, Web, Library• Compound resource

• Stage• Cache (automated)

• SRM - manage storage• Reservation

• Fail if not enough space• SURL

• Transfer on TURL• Asynchronous

• Pole for status• Clients

• GridFTP• Compound resource

• Stage

Page 34: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Synergism Between Applications

• Synergism• Distributed data

• Sources, users, performance, reliability, analysis

• Technology management• Incorporate new technology

• Unique components• Information management

• Semantics, formats, services

• Management policies• Integrity, authenticity, availability, authorization

Page 35: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Observations of Production Data Grids

• Each community implements different management polices

• Need a mechanism to support the socialization of shared collections• Community specific preservation objectives• Community specific assertions about

properties of the shared collection• Community specific management policies

Page 36: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Data Grid Mechanisms

• Implement essential components needed for synergism• Infrastructure independence• Data and trust virtualization

• Implement components needed for specific management policies and processes• Map processes to standard micro-services• Management virtualization

Page 37: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Data Management

Data ManagementEnvironment

ConservedProperties

ControlMechanisms

RemoteOperations

ManagementFunctions

AssessmentCriteria

ManagementPolicies

Capabilities

Data grid Ğ Management virtualizationData Management

InfrastructurePersistent

StateRules Micro-services

Data grid Ğ Data and trust virtualizationPhysical

InfrastructureDatabase Rule Engine Storage

System

iRODS - integrated Rule-Oriented Data System

Page 38: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS Class Exercise

• http://irods.sdsc.edu• Downloads

• BSD license• Registration / agreement

• Tar file• Installation script (Linux, Solaris, Mac OSX)• Automated download of PostgreSQL, ODBC• Installation of PostgreSQL, ODBC, iRODS• Initiation of iRODS collection

Page 39: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS Installation Class Exercise

• Unpack the release tar file • gzip -d irods.tar• tar xf irods.tar

• cd into the top directory • run 'install.pl' (no changes are needed).

• It will prompt for a few parameters or you can edit install.config.

Page 40: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Install.pl Options• $ ./install.pl help• This script performs the steps required to do a full postgresQL ICAT-enabled

IRODS installation, including downloading postgresql and odbc, configuring and building Postgresql, Postgresql-ODBC, and IRODS,initializing the database, ingesting the ICAT tables, configuring a user environment, and bringing up the system.

• Command line options are:

• (blank) Do an installation, continuing where it left off previously.

• If the installation is done, it will just print some messages.

• help (or -h) This summary.

• ps Show the postgresql and irods server processes

• stop Stop the postgresql and irods server processes

• start Stop the postgresql and irods server processes

• vacuum Stop the rods servers and do a postgresql vacuum

• (indexing is done automatically).

• clean Stop processes and remove what install.pl has

• built. Handy when testing.

• drop drop the DB and reuse the postgresql installation when building

• again. The script will automatically ask about doing this

• when installing if the postgresql installation exists.

• Handy for testing.

Page 41: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Install.pl StatesClass Exercise

• Track the completion status of each step:

• # If you want to redo the installation steps, you can remove the stateFile.

• $stateFile="install.state”

• # Major state/steps:

• # A - build and install postgres

• # B - build and install odbc

• # C - build irods

• # D - configure and run postgres server, create db

• # E - create the ICAT database

• # F - finish ICAT and set up user environment

Page 42: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS Source Distribution

• api• clientLib• clients• config.guess• configure• COPYRIGHT• CVS• doc• install• install.config• install.config.orig

• install.pl• installLogs• installOra.pl• installServer.pl• LICENSE.txt• Makefile• mk• nt• README• server• Vault

Page 43: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Directory ~rods/serverClass Exercise

• ls -l server• total 0

• drwxr-xr-x 6 reaganmo admin 204 Jul 23 08:54 CVS

• drwxr-xr-x 11 reaganmo admin 374 Jul 23 17:05 bin

• drwxr-xr-x 13 reaganmo admin 442 Jul 23 17:05 config

• drwxr-xr-x 47 reaganmo admin 1598 Jul 23 08:54 include

• drwxr-xr-x 15 reaganmo admin 510 Sep 19 13:52 log

• drwxr-xr-x 121 reaganmo admin 4114 Jul 23 17:05 obj

• drwxr-xr-x 4 reaganmo admin 136 Jul 23 08:54 schema

• drwxr-xr-x 9 reaganmo admin 306 Jul 23 08:54 src

• drwxr-xr-x 17 reaganmo admin 578 Jul 23 17:05 test

Page 44: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Directory ~rods/server/binClass Exercise

• $ ls -l server/bin• total 23024

• drwxr-xr-x 6 reaganmo admin 204 Jul 23 08:54 CVS

• drwxr-xr-x 4 reaganmo admin 136 Jul 23 08:54 cmd

• -rwxr-xr-x 1 reaganmo admin 3917976 Jul 23 17:05 irodsAgent

• -rwxr-xr-x 1 reaganmo admin 3916864 Jul 23 17:05 irodsReServer

• -rwxr-xr-x 1 reaganmo admin 3924136 Jul 23 17:05 irodsServer

• -rwxr-xr-x 1 reaganmo admin 1655 Dec 13 2006 list.pl

• -rwxr-xr-x 1 reaganmo admin 5236 Apr 11 17:12 start.pl

• -rwxr-xr-x 1 reaganmo admin 3294 Mar 23 10:49 stop.pl

• -rwxr-xr-x 1 reaganmo admin 3400 Dec 13 2006 vacuumdb.pl

Page 45: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Directory ~rods/server/configClass Exercise

• $ ls -l server/config• total 40

• drwxr-xr-x 6 reaganmo admin 204 Jul 23 08:54 CVS

• -rw-r--r-- 1 reaganmo admin 782 May 24 16:01 HostAccessControl

• -rw------- 1 reaganmo admin 665 Jul 23 17:05 irodsHost

• -rw-r--r-- 1 reaganmo admin 665 Jul 9 15:20 irodsHost.in

• -rw-r--r-- 1 reaganmo admin 0 Jul 23 17:05 irodsHost.sav

• drwxr-xr-x 3 reaganmo admin 102 Jul 23 08:54 log

• drwxr-xr-x 3 reaganmo admin 102 Sep 20 14:34 packedRei

• drwxr-xr-x 19 reaganmo admin 646 Aug 7 11:18 reConfigs

• -rw------- 1 reaganmo admin 956 Jul 23 17:05 server.config

• -rw-r--r-- 1 reaganmo admin 970 Mar 5 2007 server.config.in

• -rw-r--r-- 1 reaganmo admin 0 Jul 23 17:05 server.config.sav

Page 46: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Directory ~rods/server/config/reconfigsClass Exercise

• $ ls -l server/config/reconfigs• total 304

• drwxr-xr-x 5 reaganmo admin 170 Jul 23 08:54 CVS

• -rw-r--r-- 1 reaganmo admin 4102 Jun 13 16:05 core.dvm

• -rw-r--r-- 1 reaganmo admin 737 Jun 13 16:05 core.fnm

• -rw------- 1 reaganmo admin 28404 Aug 7 11:17 core.irb

• -rwxr-xr-x 1 reaganmo admin 14353 Jul 24 15:07 core.irb.1

• -rw------- 1 reaganmo admin 28404 Aug 7 11:18 core.irb.2

• -rw-r--r-- 1 reaganmo admin 101 Jul 10 16:39 core.irb.3

• -rw-r--r-- 1 reaganmo admin 14353 Jul 24 15:07 core.irb.4

• -rwxr-xr-x 1 reaganmo admin 14052 Jul 12 09:34 core.irb.orig

• -rw-r--r-- 1 reaganmo admin 690 Sep 22 2006 core2.irb

• -rw-r--r-- 1 reaganmo admin 714 Oct 3 2006 core3.irb

• -rw-r--r-- 1 reaganmo admin 269 Oct 3 2006 misc.irb

• -rw-r--r-- 1 reaganmo admin 1372 Oct 3 2006 reRules

Page 47: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS Clients• Currently six clients

• iRODS rich web client• https://rt.sdsc.edu:8443/irods_web/yui_ext/index.php

• Unix shell commands• RODS/clients/icommands/bin

• FUSE user level file system• RODS/clients/fuse/bin

• Jargon Java I/O class library• PHP web browser• C library calls

Page 48: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Directory ~rods/clients/icommands/binClass Exercise

• $ ls -l clients/icommands/bin• total 23984

• drwxr-xr-x 5 reaganmo admin 170 Jul 23 08:54 CVS

• -rwxrwxrwx 1 reaganmo admin 3979 Jul 26 06:21 HELP.looptest

• -rw-r--r-- 1 reaganmo admin 57 Jul 10 16:45 chgCoreToCore1.ir

• -rw-r--r-- 1 reaganmo admin 52 Jun 20 12:24 chgCoreToOrig.ir

• -rwxrwxrwx 1 reaganmo admin 363 Jul 6 08:27 chksumColl.ir

• -rwxrwxrwx 1 reaganmo admin 422 Jul 6 10:42 copyColl.ir

• -rw-r--r-- 1 reaganmo admin 334 Jul 23 16:20 exportMetadata.ir

• -rw-r----- 1 reaganmo admin 4585 Aug 15 09:05 foo1

• -rwxrwxrwx 1 reaganmo admin 365 Jul 6 10:42 forcechksumColl.ir

• -rwxr-xr-x 1 reaganmo admin 388532 Jul 23 17:05 iadmin

• -rwxr-xr-x 1 reaganmo admin 391596 Jul 23 17:05 icd

• -rwxr-xr-x 1 reaganmo admin 413248 Jul 23 17:05 ichksum

• -rwxr-xr-x 1 reaganmo admin 397860 Jul 23 17:05 ichmod

• -rwxr-xr-x 1 reaganmo admin 419320 Jul 23 17:05 icp

• -rwxr-xr-x 1 reaganmo admin 353008 Jul 23 17:05 iexecmd

Page 49: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iCommands~/rods/clients/icommands/bin

• icd• ichmod• icp• ils• imkdir• imv• ipwd• irm

• iget • iput• ireg• irepl• itrim• irsync • ilsresc• iphymv• irmtrash • ichksum• iinit • iexit

• iqdel• iqmod• iqstat• iquest• iexecmd• irule• iuserinfo• isysmeta• imeta• imiscsvrinfo• iadmin

Page 50: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS Components

• Persistent state information catalog - iCAT• Server middleware• Clients• Rule engine

• Implements server-side workflows composed from micro-services

• Rules control execution of micro-services

Page 51: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Data Grid

Using a Data Grid – in Abstract

•User asks for data from the data grid•The data is found and returned

•Where & how details are hidden

Page 52: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Using a Data Grid - Details

iRODS Server

•Data request goes to iRODS Server

iRODS Server Metadata Catalog

DB

•Server looks up information in catalog•Catalog tells which iRODS server has data•1st server asks 2nd for data•The 2nd iRODS server applies rules

•User asks for data

Page 53: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Connecting to iRODS CollectionClass Exercise

• iinit - initiate connection using default parameters specified in the file ~/.irods/.irodsEnv• irodsHost 'localhost'• irodsPort 1247• irodsDefResource=demoResc• irodsHome=/tempZone/home/rods• irodsCwd=/tempZone/home/rods• irodsUserName 'rods'• irodsZone 'tempZone'

• Authentication done using the file ~/.irods/.irodsAuth

Page 54: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Disconnect From iRODSClass Exercise

• $ iexit -h

• Exits iRODS session (cwd) and optionally removes the scrambled password file produced by iinit.

• Usage: iexit [-vh] [full]

• If 'full' is included the scrambled password is also removed.

• -v verbose

• -V very verbose

• -h this help

Page 55: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Data Grids• Data virtualization

• Provide the persistent, global identifiers needed to manage distributed data

• Provide standard operations for interacting with heterogeneous storage system

• Map from storage protocols to preferred clients• Trust virtualization

• Manage authentication and authorization• Enable access controls on data, metadata, storage

• Federation• Controlled sharing of name spaces, files, and

metadata between independent data grids• Data grid chaining / Central archives / Master-slave

data grids / Peer-to-Peer data grids

Page 56: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Data Virtualization

Storage Repository

• Storage location

• User name

• File name

• File context (creation date,…)

• Access controls

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Access constraints

Data Collection

Data Access Methods (C library, Unix, Web Browser)

Data is organized as a shared collection

Page 57: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS File Name SpaceClass Exercise

$ ils -l/tempZone/home/rods:

$ imkdir nvo$ imkdir tg$ imkdir looptest$ ils -l

Do you see the new directories?

Page 58: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS File Name SpaceClass Exercise

$ ils -l/tempZone/home/rods:

C- /tempZone/home/rods/loopTestC- /tempZone/home/rods/nvoC- /tempZone/home/rods/tg

Page 59: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS Storage Name SpaceClass Exercise

Create Storage Resource$ iadmin mkresc demo2Resc 'unix file system' archive localhost /Applications/iRODS/Vault2

List storage resources$ ilsrescdemo2RescdemoResc

Page 60: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS Storage Name SpaceClass Exercise

$ ilsresc -lresource name: demo2Rescresc id: 10010zone: tempZonetype: unix file systemclass: archivelocation: localhostvault: /Applications/iRODS/Vault2free space: info: comment: create time: 01185236177: 2007-07-23.17:16:17modify time: 01185236177: 2007-07-23.17:16:17

Page 61: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS User Name SpaceClass Exercise

$ iuserinfoname: rodsid: 10004type: rodsadminzone: tempZonedn: info: comment: create time: 01185235539modify time: 01185235539Not a member of any group

Page 62: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Data Virtualization

Storage System

Storage Protocol

Access Interface

Standard Micro-services

Data Grid

Map from the actions

requested by the access method to a standard set of micro-services. Thestandard micro-services are mapped to the operations

supported by the storage system

Standard Operations

Page 63: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Standard Operations

• The capabilities needed to interact with storage systems • Posix I/O• File manipulation• Metadata manipulation• Bulk operations• Parallel I/O• Remote procedures• Registration

Page 64: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS File Manipulation$ iput -hUsage : iput [-fkKrvV] [-D dataType] [-N numThreads] [-n replNum]

[-p physicalPath] [-R resource] [-X restartFile]localSrcFile|localSrcDir ... destDataObj|destColl

Usage : iput [-fkKvV] [-D dataType] [-N numThreads] [-n replNum] [-p physicalPath] [-R resource] [-X restartFile] localSrcFile

Store a file into iRODS. If the destination data-object or collection are not provided, the current irods directory and the input file name are used. The -X option specifies that the restart option is on and the restartFile input specifies a local file that contains the restart info. If the restartFile does not exist, it will be created and used for recording subsequent restart info. If it exists and is not empty, the restart info contained in this file will be used for restarting the operation. Note that the restart operation only works for uploading directories and the path input must be identical to the one that generated the restart file

Page 65: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS File Manipulation$ iput -hOptions are:-f force - write data-object even if it exists already;

overwrite it-k checksum - calculate a checksum on the data-K verify checksum - calculate and verify the checksum on the

data-N numThreads - the number of transfer threads to use. A value

of 0 means no threading. By default (-N option not used) the server decides the number of threads to use.

-R resource - specifies the resource to store to. This can be specified via a rule set up by the administrator.

-r recursive - store the whole subdirectory-v verbose-V Very verbose-X restartFile - specifies that the restart option is on and

the local file that contains the restart info.-h this help

Page 66: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Put a File into the iRODS CollectionClass Exercise

• $ iput foo1

• $ ils -l

• /tempZone/home/rods:

• rods 0 demoResc 4585 2007-09-20.16:18 & foo1

• C- /tempZone/home/rods/nvo

• C- /tempZone/home/rods/tg

Page 67: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Transcontinental Persistent Archive Prototype

• Distributed Data Management Concepts• Data virtualization

• Storage system independence• Trust virtualization

• Administration independence

• Risk mitigation• Federation of multiple independent data grids

• Operation independence

Page 68: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

National Archives and Records Administration Transcontinental Persistent Archive Prototype

U Md SDSC

MCAT MCAT

Georgia Tech

MCAT

Federation of Seven Independent Data Grids

NARA II

MCAT

NARA I

MCAT

Extensible Environment, can federate with additional research and education sites. Each data grid uses different vendor products.

Rocket Center

MCAT

U NC

MCAT

Page 69: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Data Management Challenges• Authenticity

• Manage descriptive metadata for each file• Manage access controls• Manage consistent updates to administrative metadata

• Integrity• Manage checksums• Replicate files• Synchronize replicas• Federate data grids

• Infrastructure independence• Manage collection properties • Manage interactions with storage systems• Manage distributed data

Page 70: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Digital Preservation

• Preservation is communication with the future• How do we migrate records onto new technology

(information syntax, encoding format, storage infrastructure, access protocols)?

• SRB - Storage Resource Broker data grid provides the interoperability mechanisms needed to manage multiple versions of technology

• Preservation manages communication from the past• What information do we need from the past to make

assertions about preservation assessment criteria (authenticity, integrity, chain of custody)?

• iRODS - integrated Rule-Oriented Data System

Page 71: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Data Grids• SRB - Storage Resource Broker

• Persistent naming of distributed data• Management of data stored in multiple types of storage

systems• Organization of data as a shared collection with descriptive

metadata, access controls, audit trails• iRODS - integrated Rule-Oriented Data System

• Rules control execution of remote micro-services• Manage persistent state information• Validate assertions about collection• Automate execution of management policies

Page 72: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Preservation Models• Diplomatics (InterPARES)

• Authenticity of records asserted by submitting institution• Records preserved forever

• Preservation lifecycle (NARA)• Arrangement / Hierarchical metadata - Record Group,

Record series, Folder, Item, Object• Archival information packages (AIP)

• Continuum (NSDL)• Preservation within context of active records (active data

grid)• Digital library (DSpace)

• Digital library standards for arrangement and description (METS, OAI-PMH)

Page 73: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

InterPARES - Diplomatics• Authenticity - maintain links to metadata for:

• Date record is made• Date record is transmitted• Date record is received• Date record is set aside [i.e. filed]• Name of author (person or organization issuing the record)• Name of addressee (person or organization for whom the record is intended)• Name of writer (entity responsible for the articulation of the record’s content)• Name of originator (electronic address from which record is sent)• Name of recipient(s) (person or organization to whom the record is sent)• Name of creator (entity in whose archival fonds the record exists)• Name of action or matter (the activity for which the record is created)• Name of documentary form (e.g. E-mail, report, memo)• Identification of digital components • Identification of attachments (e.g. digital signature)• Archival bond (e.g. classification code)

Page 74: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

InterPARES - Diplomatics• Integrity - maintain links to metadata for

• Name(s) of the handling office / officer• Name of office of primary responsibility for

keeping the record• Annotations or comments• Actions carried out on the record• Technical modifications due to transformative

migration• Validation

Page 75: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Preservation Rules• Authenticity

• Rules that quantify required descriptive metadata• Rules that verify descriptive metadata is linked to records• Rules that govern creation of AIPs

• Integrity• Rules that verify records have not been corrupted• Rules that manage replicas• Rules that recover from corruption instances• Rules that manage data distribution

• Chain of custody• Persistent identifiers for archivists, records, storage• Rules to verify application of access controls• Rules to track storage location of records

Page 76: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Integrity Challenges

• Data grids manage shared collections that are distributed across multiple storage systems and institutions• Data grids are responsible for providing recovery

mechanisms for all errors that occur in the distributed environment

• The number of observed problems is proportional to the size of the collections

Page 77: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Integrity MechanismsClass Exercise

• $ irepl -h• Usage : irepl [-aBMrvV] [-n replNum] [-R

destResource] [-S srcResource] [-X restartFile] dataObj|collection ...

• Replicate a file in iRODS to another storage resource.

• $ irepl -R demo2Resc foo1

• $ ils -l

• /tempZone/home/rods:

• rods 0 demoResc 4585 2007-08-30.14:33 & foo1

• rods 1 demo2Resc 4585 2007-09-18.17:36 & foo1

Page 78: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Integrity Mechanisms

• $ irsync -h• Usage : irsync [-rahsvV] [-R resource]

sourceFile|sourceDirectory [....] targetFile|targetDirectory•• Synchronize the data between a local copy (local file system)

and the copy stored in iRODS or between two iRODS copies. The command can be in one of the three modes:• synchronization of data from the client's local file system to iRODS, • from iRODS to the local file system, • from one iRODS path to another iRODS path.

• The mode is determined by the way the sourceFile|sourceDirectory and targetFile|targetDirectory are specified. • Files and directories prepended with 'i:' are iRODS files and collections.• Local files and directories are specified without any prependage.

Page 79: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Integrity Mechanisms

• irsync -r foo1 i:foo2• synchronizes recursively the data from the local

directory foo1 to the iRODS collection foo2

• irsync -r i:foo1 foo2• synchronizes recursively the data from the iRODS

collection foo1 to the local directory foo2.

• irsync -r i:foo1 i:foo2• synchronizes recursively the data from the iRODS

collection foo1 to another iRODS collection foo2.

Page 80: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Integrity Mechanisms• $ ichksum -h• Usage : ichksum [-harvV] [-K|f] [-n replNum] dataObj|collection• Checksum one or more data-object or collection from iRODS

space.• Options are:

• -f force checksum data-objects even if a checksum already exists• -a checksum all replica.• -K verify the checksum value in icat. If the checksum value does

not exist, compute and register one.• -n replNum - the replica to checksum; if not specified checksum all

replicas• -r recursive - checksum the whole subtree; the collection, all data-

objects in the collection, and any subcollections and sub-data-objects in the collection.

Page 81: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Class Exercise - HELP.looptest

• Make two test collections, and load files from your system

• imkdir loopTest

• imkdir loopTest2

• icd loopTest

• iput ../src/ipwd.c

• iput ../src/iquest.c

• iput ../src/ils.c

• ils -l

Page 82: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRULE icommand• $ irule -h• Usage : irule [--test] [-F inputFile] [ruleBody inputParam outParamDesc]• Submit a user defined rule to be executed by an irods server. The command• requires 3 input: • 1) ruleBody - This is the body of the rule to be executed.• 2) inputParam - The input parameters. The input values for the rule is specified here.

If there is no input, a string containing "null” must be specified.• 3) outParamDesc - Description for the set of output parmeters to be returned. If there

is no output, a string containing "null" must be specified.• The input can be specified through the command line or in a file using the -F option.

The inputFile should contain 3 lines, one for each input. An example of the input is given in the file:

• clients/icommands/test/ruleInp1• Options are:• --test enable test mode so that the microservices are not executed, instead a

loopback is performed• -F inputFile - read the file for the input• -v verbose• -h this help

Page 83: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Class Exercise - HELP.looptest

• /* LISTING AND CHECKSUM */

• irule -F ../test/listColl.ir

• ichksum -r .

• irule -F ../test/showicatchksumColl.ir

• /** use the following to change a file under iRODS

• iquest "select DATA_PATH where DATA_NAME = 'iquest.c'”

• vi

• **/ modify file

• irule -F ../test/verifychksumColl.ir

• irule -F ../test/forcechksumColl.ir

Page 84: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Types of Risk • Media failure

• Replicate data onto multiple media• Vendor specific systemic errors

• Replicate data onto multiple vendor products• Operational error

• Replicate data onto a second administrative domain• Natural disaster

• Replicate data to a geographically remote site• Malicious user

• Replicate data to a deep archive

Page 85: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

How Many Replicas

• Three sites minimize risk• Primary site

• Supports interactive user access to data

• Secondary site• Supports interactive user access when first site is down• Provides 2nd media copy, located at a remote site, uses

different vendor product, independent administrative procedures

• Deep archive• Provides 3rd media copy, staging environment for data

ingestion, no user access

Page 86: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Data Reliability

• Manage checksums• Verify integrity • Rule to verify checksums

• Synchronize replicas• Verify consistency between metadata and

records in vault • Rule to verify presence of required metadata

• Federate data grids• Synchronize metadata catalogs

Page 87: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Rule-based Data Management

• Map from management policies to rules controlling execution of remote micro-services

• Manage persistent state information for results of each micro-service execution

• Support an additional three logical name spaces• Rules• Micro-services• Persistent state information

• Constitutes representation information for preservation environments

Page 88: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Micro-Services• Challenge is that storage systems do

not provide required sophisticated operations• Have “minimal” set of standard operations that

are performed at the storage system• Have actions required by clients such as

replication, metadata extraction• Create standard micro-services that serve

aggregate storage operations into modules that can be aggregated to implement desired client actions.

Page 89: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Micro-service Classes

• Test micro-services• System micro-services• Workflow micro-services• System micro-services• User micro-services called by client• iCAT micro-services• User micro-services invoked by “irule”• Image manipulation micro-services

Page 90: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Example Micro-Serviceshttp://irods.sdsc.edu/index.php/List_of_Micro-Services

• Workflow Services:

• nop, null - no action

• cut - not to retry any other

applicable rules for this action

• succeed - succeed immediately

• fail - fail immediately - recovery and

retries are possible

• msiGoodFailure - useful when you want to fail

but no recovery initiated.

• msiNullAction - same as nop

• whileExec - while loop over result set

• forExec - for loop over result set

Page 91: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Example Micro-Serviceshttp://irods.sdsc.edu/index.php/List_of_Micro-Services

• System Micro Services - Can only be called by the server process.

• msiSetDefaultResc - set the default resource

• msiSetNoDirectRescInp - sets a list of resources that cannot

be used by a normal user directly.

• msiSetRescSortScheme - set the scheme for selecting the best

resource to use

• msiSetMultiReplPerResc - sets the number of copies per resource

to unlimited

• msiSetDataObjPreferredResc - if the data has multiple copies,

specify the preferred copy to use

• msiSetDataObjAvoidResc - specify the copy to avoid

• msiSortDataObj - Sort the replica randomly when

choosing which copy to use

• msiSetNumThreads - specify the parameters for determining

the number of threads to use for data transfer.

• msiSysChksumDataObj - checksum a data object.

• msiSysReplDataObj - replicate a data object.

Page 92: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Example Micro-Serviceshttp://irods.sdsc.edu/index.php/List_of_Micro-Services

• msiStageDataObj - stage the data object to the specified

resource before operation.

• msiNoChkFilePathPerm - Do not check file path permission when

registering

• msiNoTrashCan - Set the policy to no trash can.

• msiSetPublicUserOpr - Sets a list of operations that can

be performed by the user "public".

• msiSetGraftPathScheme - Set the scheme for composing the physical

path in the vault to GRAFT_PATH.

• msiSetRandomScheme - set the the scheme for composing the

physical path in the vault to RANDOM.

• msiCheckHostAccessControl - Set the access control policy.

• msiDeleteDisallowed - Set the policy for determining certain data

cannot be deleted.

• msiSetResource() - sets the resource from default

• msiSetResourceList - get a resource based on conditions

• msiSetDataTypeFromExt() - get data type based on file name extension

Page 93: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Rules

• Roles• Internal rules used to maintain consistency

between operations and persistent state information

• Administrator controlled rules that are automatically invoked

• User executable rules

Page 94: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Rules

• Types of rules• Atomic rules - executed on each operation

invoked by a client

• Deferred rules - executed at a future time

• Periodic rules - executed to validate assessment criteria and enforce desired properties

Page 95: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS Rule Syntax

• Event | Condition | Action-set | Recovery-set• Event - triggered by operation or queued rule

• Condition - composed by tests on any attributes inthe persistent state information

• Action-set - composed from both micro-services and rules

• Recovery-set - used to ensure transaction semanticsand consistent state information

Page 96: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS Rule Sample$ irule -F showcore.ir5 core.acCreateUser

msiCreateUser [msiRollback]acCreateDefaultCollections [msiRollback]msiCommit

7 core.acCreateDefaultCollections

acCreateUserZoneCollections

8 core.acCreateUserZoneCollections

acCreateCollByAdmin(/$rodsZoneProxy/home,$otherUserName)

acCreateCollByAdmin(/$rodsZoneProxy/trash/home,$otherUserName)

9 core.acCreateCollByAdmin(*parColl,*childColl)msiCreateCollByAdmin(*parColl,*childColl)

Page 97: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS Rich Web Client

• Use web browser to access a remote iRODS shared data collection• Host/IP rt.sdsc.edu• Port 1247• Username demoUser• Password demo

Page 98: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Web Client Collection Upload• Click the 'Sign-on' button, and user should be redirected to his

home collection '/tempZone/home/demoUser'• Click on the 'new' button, and try create a new collection

'upload’• Double click on the newly created collection 'upload'

• Click on the 'upload' button, and pick 'Single file’• Upload a image file with that dialog.• The file should appear after the uploading.

• Click the 'upload -> files and collections' button, the applet should load, and it will take a while (4-7 sec)• Drag and drop files/directories to the box inside applet. • Click 'upload' on the bottom of the applet, uploading will start.

• Close the applet layer. • The newly uploaded files/directories will appear after refreshing the

current directory either by click on the refresh button on the bottom of the grid, or refresh the web page.

Page 99: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Web Client Image Manipulation• Double click on a image file (jpg, png, or bmp) to bring up the

file viewer dialog.• Click on the 'Open' button to open the image file with a new browser

window• Close the new window and file viewer dialog

• Click on the 'More -> metadata', to bring up the metadata dialog'• Add/modify/delete some random metadata, similar to spreadsheet

operations, and click 'save', note that the metadata name must be unique.

• Close the metadata dialog• Click on the 'delete' button, to delete the file.

• The quick search box is implemented. Type a partial file name in the quick search box, located at top right corner, and press enter. The partial name is case sensitive. For instance, is there is a file named 'bar_foo_bar.txt', a search for string 'foo' will return the file. The result files are also clickable.

• Click on the 'Sign Out' link on top right corner of the screen.

Page 100: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iCommands

• iinit initialize access• imkdir directory make directory• ils list files• ilsresc list storage resources• iput directory file put file into iRODS• iget file get file from iRODS• imeta -h list metadata options

Page 101: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Metadata ManipulationClass Exercise

• $ imeta -h• Usage: imeta [-vVh] [command]• Commands are:• add -d|C|R|u Name AttName AttValue [AttUnits] (Add new AVU triplet)• rm -d|C|R|u Name AttName AttValue [AttUnits] (Remove AVU)• rmw -d|C|R|u Name AttName AttValue [AttUnits] (Remove AVU, use Wildcards)• ls -d|C|R|u Name [AttName] (List existing AVUs for item Name)• lsw -d|C|R|u Name [AttName] (List existing AVUs, use Wildcards)• qu -d|C|R|u AttName Op AttVal (Query objects with matching AVUs)• cp -d|C|R|u -d|C|R|u Name1 Name2 (Copy AVUs from item Name1 to Name2)•• Metadata attribute-value-units triplets (AVUs) consist of an Attribute-Name, Attribute-

Value, and an optional Attribute-Units. They can be added via the 'add' command and then queried to find matching objects.

• For each command, -d, -C, -R or -u is used to specify which type of object to work with: dataobjs (irods files), collections, resources, or users. (Within imeta -c and -r can be used, but -C and -R are the iRODS standard options for collections and resources.)

Page 102: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Metadata ManipulationClass Exercise

• $ imeta add -d foo1 Genealogy Moore

• $ imeta add -d foo1 “number of persons” 170,682

• $ imeta ls -d foo1

AVUs defined for dataObj foo1:

attribute: Genealogy

value: Moore

units:

----

attribute: number of persons

value: 170,682

units:

Page 103: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

TrashClass Exercise

• irm transfers file to the trash• Trash collection is located at

• /tempZone/trash• Your directory structure is replicated as

files are removed• irm foo1• /tempZone/trash/rods/foo1

• irmtrash removes files from trash

Page 104: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

User Level Rules

• irule -F rulename Execute your rule

• Rules• showCore.ir list current rule base• listColl.ir list checksums• verifychksumColl.ir verify checksums• forcechksumColl.ir update checksums• replColl.ir replicate collection

Page 105: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Checksum Verification Example$ more ../test/listColl.irmyTestRule | | acGetIcatResults(*Action,*Condition,*B)##forEachExec(*B,msiPrintKeyValPair(stdout,*B) ##writeLine(stdout,*K),nop) | nop ## nop*Action=list%*Condition= COLL_NAME =

'/tempZone/home/rods/loopTest'%*K=--------FILE-------------*Action%*Condition%ruleExecOut

Page 106: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Core.irb FileClass Exercise

$ irule -F showcore.ir0 core.acPostProcForPut

IF ($objPath like /tempZone/home/rods/nvo/*) msiSysReplDataObj(nvoReplResc,null)

1 core.acPostProcForPut

IF ($objPath like /tempZone/home/rods/tg/*) delayExec(<PLUSET>1m</PLUSET>,msiSysReplDataObj(tgReplResc,null),nop)

2 core.acPostProcForPutIF ($objPath like *.mdf)

msiLoadMetadataFromFile [msiRollback]

Page 107: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Test ReplicationClass Exercise

• more irodsdemo.txtiadmin mkresc demo3Resc 'unix file system' archive localhost /Applications/iRODS/Vault3iadmin mkresc nvoReplResc 'unix file system' archive localhost /Applications/iRODS/VaultNVOiadmin mkresc tgReplResc 'unix file system' archive localhost /Applications/iRODS/VaultTGp

../../../server/bin/stop.pl

../../../server/bin/start.pl

Page 108: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Test ReplicationClass Exercise

• imkdir nvo• imkdir tg• ils -l nvo• iput -R demoResc ../src/icd.c nvo• ils -l nvo

• How is this different?

Page 109: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Test ReplicationClass Exercise

• ils -l tg• iput -R demoResc ../src/icd.c tg• ils -l tg

• How is this different?• Iqstat -l

• Check that the second copy is made

Page 110: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Preservation Projects• 1998: USPTO - Patent Digital Library• 1999-2008: NARA - Transcontinental Persistent Archive

Prototype• 2000: NHPRC InterPARES I & II - International Preservation

of Authentic Records in Electronic Systems• 2002-2007: NSF NSDL - Web crawl preservation• 2004: NHPRC - Persistent Archive Testbed • 2004: UCSD Libraries - Image collection• 2004: NARA - DSpace/SRB integration • 2005: LC NDIIPP - CDL Digital Preservation Repository• 2005: NSF/LC Digital Archiving project - UCSDtv

“Conversations with History”• 2006: IMLS - UCHRI data grid for humanities• 2007: NSF Software Development Cyberinfrastructure

Page 111: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Digital Preservation

• Preservation manages communication from the past• What information do we need from the past to

make assertions about preservation assessment criteria (authenticity, integrity, chain of custody)?

• RLG/NARA Trusted Repository Audit and Certification Criteria

Page 112: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Theory of Data Management• Characterization

• Persistent name spaces• Operations that are performed upon the persistent name spaces• Changes to the persistent state information associated with each persistent name

space that occur for each operation• Transformations that are made to the records on each operation

• Completeness• Set of operations is complete, enabling the decomposition of every management

process onto the operation set.• Management policies are complete, enabling the validation of all assessment

criteria.• Persistent state information is complete, enabling the validation of authenticity and

integrity.• Assertion

• If the operations are reversible, then a future management environment can recreate a record in its original form, maintain authenticity and integrity, support access, and display the record.

• Such a system would allow records to be migrated between independent implementations of managment environments, while maintaining authenticity and integrity.

Page 113: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Federation Between Data Grids

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical rule name space

• Logical micro-service name

• Logical persistent state

Data Collection B

Data Access Methods (Web Browser, DSpace, OAI-PMH)

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical rule name space

• Logical micro-service name

• Logical persistent state

Data Collection A

Page 114: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Rule-based Federation

• Exchange files between data grids• Associate management policies with the

exchanged files• Associate micro-services with the exchanged

files• Implication is that can specify redaction

requirements on a file that is deposited in another data grid

Page 115: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS Development

• NSF - SDCI grant “Adaptive Middleware for Community Shared Collections”• iRODS development, SRB maintenance

• NARA - Transcontinental Persistent Archive Prototype• Trusted repository assessment criteria

• NSF - Ocean Research Interactive Observatory Network (ORION)• Real-time sensor data stream management

• NSF - Temporal Dynamics of Learning Center data grid• Management of IRB approval

Page 116: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

iRODS Development Status

• Current release is version 0.9.2• June 2007

• Production release will be version 1.0• Fall quarter 2007

• International collaborations• SHAMAN - University of Liverpool

• Sustaining Heritage Access through Multivalent ArchiviNg

• UK e-Science data grid• IN2P3 in Lyon, France• DSpace policy management

Page 117: Data Migration and Policy Management - MSST Conferencestorageconference.us/2007/archival-workshop/SDSC... · backup functionality sufficient for the repositoryÕs services and for

Planned Development• GSI support• Time-limited sessions via a one-way hash authentication• Python Client library• GUI Browser (AJAX in development)• Driver for HPSS (in development)• Driver for SAM-QFS• Porting to additional versions of Unix/Linux• Porting to Windows• Support for MySQL as the database• API support packages based on existing MountColl driver• MCAT to ICAT migration tools• Extensible Metadata including Databases Access Interface• Zones/Federation • Auditing - mechanisms to record and track iRODS metadata changes