© 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.
-
Upload
kimberly-gordon -
Category
Documents
-
view
213 -
download
0
Transcript of © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.
© 2006 Open Grid Forum
Preservation Enviroment Research GroupRule-based preservation
2© 2006 Open Grid Forum
OGF IPR Policies Apply
• “I acknowledge that participation in this meeting is subject to the OGF Intellectual Property Policy.”• Intellectual Property Notices Note Well: All statements related to the activities of the OGF and
addressed to the OGF are subject to all provisions of Appendix B of GFD-C.1, which grants to the OGF and its participants certain licenses and rights in such statements. Such statements include verbal statements in OGF meetings, as well as written and electronic communications made at any time or place, which are addressed to:
• the OGF plenary session, • any OGF working group or portion thereof, • the OGF Board of Directors, the GFSG, or any member thereof on behalf of the OGF, • the ADCOM, or any member thereof on behalf of the ADCOM, • any OGF mailing list, including any group list, or any other list functioning under OGF auspices, • the OGF Editor or the document authoring and review process
• Statements made outside of a OGF meeting, mailing list or other function, that are clearly not intended to be input to an OGF activity, group or function, are not subject to these provisions.
• Excerpt from Appendix B of GFD-C.1: ”Where the OGF knows of rights, or claimed rights, the OGF secretariat shall attempt to obtain from the claimant of such rights, a written assurance that upon approval by the GFSG of the relevant OGF document(s), any party will be able to obtain the right to implement, use and distribute the technology or works when implementing, using or distributing technology based upon the specific specification(s) under openly specified, reasonable, non-discriminatory terms. The working group or research group proposing the use of the technology with respect to which the proprietary rights are claimed may assist the OGF secretariat in this effort. The results of this procedure shall not affect advancement of document, except that the GFSG may defer approval where a delay may facilitate the obtaining of such assurances. The results will, however, be recorded by the OGF Secretariat, and made available. The GFSG may also direct that a summary of the results be included in any GFD published containing the specification.”
• OGF Intellectual Property Policies are adapted from the IETF Intellectual Property Policies that support the Internet Standards Process.
3© 2006 Open Grid Forum
OGF20 Preservation Environments Research Group
• Organizers: Reagan Moore ([email protected])"Bruce.Barkstrom"
<[email protected]>• Goals:
• Present archives based on data grid technology• INCIPIT virtual vellum
• Analyze capabilities required by a preservation environment• Define rule-based preservation environment - iRODS• NARA Electronic Records Archive capability requirements• RLG/NARA assessment criteria for a Trusted Digital Repository• Barkstrom GGF paper - based on NASA Langley preservation model
• Analyze capabilities that can be based on grid technology• iRODS rule-oriented data system
• Participants:• 19 contributors to data grid federation for GIN• MIT - PLEDGE project on preservation policies• SDSC - NARA research prototype persistent archive• U Md - Producer Archive Workflow Network• EU CASPAR, PLANETS; UK Digital Curation Centre
4© 2006 Open Grid Forum
Virtual Vellum
• Preserve shared collection of medieval manuscripts• Preserve provenance of manuscripts (authenticity)• Preserve display services (access)• Preserve arrangement (respect des fonds)• Preserve bits (integrity)
• http://www.dcs.shef.ac.uk/~mikem/virtualvellum• Based on SRB data grid and Virtual Vellum services• P F Ainsworth <[email protected]>
5© 2006 Open Grid Forum
Preservation Environment Requirements
• Bruce Barkstrom paper• Described capabilities needed in a preservation environment
• ERA capabilities list• http://www.crl.edu/content.asp?l1=13&l2=58&l3=160
• RLG/NARA trusted digital repository assessment criteria• http://www.dlib.org.ar/dlib/july06/ross/07ross.html
• Can we express these capabilities and assessment criteria as rules applied by the data grid?
6© 2006 Open Grid Forum
ERA Capabilities
• List of 854 required capabilities:• Management of disposition agreements describing how record retention and disposal actions• Accession, the formal acceptance of records into the data management system• Arrangement, the organization of the records to preserve a required structure (implemented as
a collection/sub-collection hierarchy)• Description, the management of descriptive metadata as well as text indexing• Preservation, the generation of Archival Information Packages• Access, the generation of Dissemination Information Packages• Subscription, the specification of services that a user picks for execution• Notification, the delivery of notices on service execution results• Queuing of large scale tasks through interaction with workflow systems• System performance and failure reports. Of particular interest is the identification of all
failures within the data management system and the recovery procedures that were invoked.• Transformative migration, the ability to convert specified data formats to new standards. In
this case, each new encoding format is managed as a version of the original record.• Display transformation, the ability to reformat a file for presentation.• Automated client specification, the ability to pick the appropriate client for each user.
7© 2006 Open Grid Forum
RLG/NARA TDR Assessment Criteria
• The assessment criteria can be mapped to management policies.• The management policies can be mapped to a set of rules whose execution
can be automated.• The rules require definition of input parameters that define the assertion being
implemented.• The execution of the rules generates state information that can be evaluated
to verify the assertion result• The types of rules that are needed include:
• Specification of assertions (setting rule parameters - flags and descriptive metadata)
• Deferred consistency constraints that may be applied at any time• Periodic rules that execute defined procedures • Atomic rules applied on each operation (access controls, audit trails)
• The rules determine the metadata attributes that need to be managed• Set of 174 rules
8© 2006 Open Grid Forum
Digital Preservation
• Preservation is communication with the future• How do we migrate records onto new technology
(information syntax, encoding format, storage infrastructure, access protocols)?
• SRB - Storage Resource Broker data grid provides the interoperability mechanisms needed to manage multiple versions of technology
• Preservation manages communication from the past• What information do we need from the past to make
assertions about preservation assessment criteria (authenticity, integrity, chain of custody)?
• iRODS - integrated Rule-Oriented Data System
9© 2006 Open Grid Forum
Socialization of Data Management
Data ManagementEnvironment
ConservedProperties
ControlMechanisms
RemoteOperations
ManagementFunctions
AssessmentCriteria
ManagementPolicies
Capabilities
Data grid – Management virtualizationData Management
InfrastructurePersistent
StateRules Micro-services
Data grid – Data and trust virtualizationPhysical
InfrastructureDatabase Rule Engine Storage
System
iRODS - integrated Rule-Oriented Data SystemiRODS - integrated Rule-Oriented Data System
10© 2006 Open Grid Forum
Preservation Management Policies
• Authenticity• Validate assertions made at time of data ingestion • Validate existence of the descriptive (provenance) metadata• Validate retention policy is consistent with submission agreement
• Integrity• Maintain information about the management of the data• Assertions made by the archivist
• Access controls, audit trails, checksums, replication, synchronization, federation
• Infrastructure independence• Manage properties of records independently of choice of storage system
• Scalability• Manage large collections (billions of records, petabytes of data,
thousands of attributes)• Aggregations across name spaces
11© 2006 Open Grid Forum
iRODS
• Separate definition of management policies (rules)from definition of remote operations (micro-services)
• Control execution of all micro-services through application of rules
• Manage persistent state information for the results
• Query the persistent state information to validate assertions on preservation properties
12© 2006 Open Grid Forum
iRODS - integrated Rule-Oriented Data System
Resources
Client Interface Admin Interface
MetadataModifierModule
ConfigModifierModule
RuleModifierModule
ConsistencyCheck
Module
Confs
RuleBase
MetadataPersistent
Repository
Engine
Rule
Current State
Rule Invoker
MicroService
Modules
Resource-based Services
MicroService
Modules
Metadata-based Services
ServiceManager
ConsistencyCheck
Module
ConsistencyCheck
Module
13© 2006 Open Grid Forum
Managing Preservation Policies
• Require at least six name spaces for managing identity• Logical storage resource name space• Logical user name space• Logical file name space• Logical rule name space• Logical micro-service name space• Logical persistent state name space
• Require ability to federate name spaces• Cross-register identity of object from each of the name spaces
• Require multiple levels of aggregation for each name space• Typically three levels of aggregation
• Trust virtualization• Ownership of the collection entities by the data grid
14© 2006 Open Grid Forum
Metadata Attributes
• Associate state information with each name space• User name
• Address, institution• Group membership• Type - (administrator, curator, owner, public)
• Logical file name• System attributes
• Location, size, owner, checksum, container, …
• User-defined attributes• Descriptive information
• Logical resource name• Type of system• Quotas
15© 2006 Open Grid Forum
Federation Between Data Grids
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical rule name space
• Logical micro-service name
• Logical persistent state
Data Collection B
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical rule name space
• Logical micro-service name
• Logical persistent state
Data Collection A
16© 2006 Open Grid Forum
Aggregation of Identifiers
• Users• {Single user, group, federation}
• Resources• {Single storage system, cached system, cluster}
• Files• {Single file, container, directory}
• Metadata• {Single attribute, hierarchical table, collection}
• Management policies• {Single capability, set of capabilities, nested rules}
17© 2006 Open Grid Forum
Demonstration of Rules
• Rule specified in four parts• Single line, parts separated by the symbol “ | ”
Name | condition | function-calls | recovery-calls• Name• Conditions• Functions calls• Recovery calls
• Support multiple functions, separated by symbol “##”acDeleteUser | |
acDeleteDefaultCollections##msiDeleteUser##msiCommit |
msiRollback##msiRollback##nop
18© 2006 Open Grid Forum
Three Classes of Rules
• Internal rules • Used within iRODS for standard data
manipulation services
• Administrator rules• Set by data grid administrator to enforce
policies on shared collection
• User-defined rules • Support server-driven workflows
19© 2006 Open Grid Forum
Rule-based Data Management
• Associate rules with combinations of name spaces• Rule set for a particular collection• Rule set for a particular user group• Rule set for a particular user group when
accessing a particular collection• Rule set for a particular storage system• Rule set for a particular micro-service• Generic rules based on SRB operations
20© 2006 Open Grid Forum
Administrative Rules
• Currently 15 administrative rules• Administrative• Storage selection• Data pre-processing• Data post-processing• Data deletion• Parallel I/O
21© 2006 Open Grid Forum
Administration Creation Rules
acCreateUser | | msiCreateUser##acCreateDefaultCollections##msiCommit | msiRollback##msiRollback##nop
acVacuum(*arg1) | | delayExec(msiVacuum,*arg1) | nop
acCreateDefaultCollections | | acCreateUserZoneCollections | nop
acCreateUserZoneCollections | | acCreateCollByAdmin(/$rodsZoneProxy/home,$otherUserName)##acCreateCollByAdmin(/$rodsZoneProxy/trash/home,$otherUserName) | nop##nop
acCreateCollByAdmin(*parColl,*childColl) | | msiCreateCollByAdmin(*parColl,*childColl) | nop
22© 2006 Open Grid Forum
Administration Deletion Rules
acDeleteUser | | acDeleteDefaultCollections##msiDeleteUser##msiCommit | msiRollback##msiRollback##nop
acDeleteDefaultCollections | | acDeleteUserZoneCollections | nop
acDeleteUserZoneCollections | | acDeleteCollByAdmin(/$rodsZoneProxy/home,$otherUserName)##
acDeleteCollByAdmin(/$rodsZoneProxy/trash/home,$otherUserName) | nop##nop
acDeleteCollByAdmin(*parColl,*childColl) | | msiDeleteCollByAdmin(*parColl,*childColl) | nop
23© 2006 Open Grid Forum
Data Manipulation Rules
Rule for pre-processing on storage useacSetRescSchemeForCreate | |
msiSetDefaultResc(demoResc,noForce)##msiSetRescSortScheme(random)##msiSetRescSortScheme(byRescType) | nop##nop##nop
Rule for pre-processing on data readsacPreprocForDataObjOpen | | msiSortDataObj(random) | nop
Rule for post processing data writesacPostProcForPut | | nop | nopacPostProcForCopy | | nop | nop
Rule for setting number of threads for parallel I/O acSetNumThreads | | msiSetNumThreads(default,default,default) | nop
Rule for data deletion policy settingacDataDeletePolicy | | nop | nop
24© 2006 Open Grid Forum
iRODS Demonstration
• Demonstrate generic put command• ilsresc• ils -l nvo• iput -R demoResc ../src/icd.c nvo• ils -l nvo
• Revise put command to automatically create a replica• cp core.irb.1 ../../../server/config/reConfigs/core.irb• ils -l nvo • iput -R demoResc ../src/ipwd.c nvo• ils -l nvo
• Illustrate execution of a user-defined rule• icd • iput carl.ged foo1• irule -vF ruleInp3
25© 2006 Open Grid Forum
iRODS Demonstration
# iRODS Rule Base - core.irb# Each rule consists of four parts separated by | # The four parts are: name, conditions, function calls, and recovery.# The calls and recoveries can be multiple ones, separated by ##.# For each rule, the number recovery calls should match the calls;# for example, if the 2nd call fails, the 2nd recover call is made.# acPreprocForDataObjOpen | | msiSortDataObj(random) | nop acSetRescSchemeForCreate | |
msiSetDefaultResc(demo2Resc,noForce)##msiSetRescSortScheme(random)##msiSetRescSortScheme(byRescType) | nop##nop##nop
acDataDeletePolicy | | nop | nopacPostProcForPut | | nop | nop
26© 2006 Open Grid Forum
iRODS Demonstration
# iRODS Rule Base - core.irb# Each rule consists of four parts separated by | # The four parts are: name, conditions, function calls, and recovery.# The calls and recoveries can be multiple ones, separated by ##.# For each rule, the number recovery calls should match the calls;# for example, if the 2nd call fails, the 2nd recover call is made.# acPreprocForDataObjOpen | | msiSortDataObj(random) | nop acSetRescSchemeForCreate | | msiSetDefaultResc(demo2Resc,noForce)##
msiSetRescSortScheme(random)##msiSetRescSortScheme(byRescType) | nop##nop##nop
acDataDeletePolicy | | nop | nopacPostProcForPut | | nop | nop
27© 2006 Open Grid Forum
iRODS Demonstration
# iRODS Rule Base# Each rule consists of four parts separated by | # The four parts are: name, conditions, function calls, and recovery.# The calls and recoveries can be multiple ones, separated by ##.# For each rule, the number of recovery calls should match the calls;# for example, if the 2nd call fails, the 2nd recovery call is made.#acPreprocForDataObjOpen | | msiSortDataObj(random) | nopacSetRescSchemeForCreate | | msiSetDefaultResc(demo2Resc,noForce)##
msiSetRescSortScheme(random)##msiSetRescSortScheme(byRescType) | nop##nop##nop
acDataDeletePolicy | | nop | nopacPostProcForPut | $objPath like /tempZone/home/rods/nvo/* |
msiSysReplDataObj(nvoReplResc) | nopacPostProcForPut | | nop | nop
28© 2006 Open Grid Forum
iRODS Demonstration
# This is an example of an input for the irule command.# This first input line is the rule body# The second input line is the input parameter in the format of label=value. # Multiple inputs can be specified using the '%' character as the separator.# The third input line is the output description. For multiple outputs use '%’ myTestRule | | msiDataObjOpen(*A,*S_FD)##
msiDataObjCreate(*B,null,*D1_FD)##msiDataObjRead(*S_FD,100,*R1_BUF)##msiDataObjWrite(*D1_FD,*R1_BUF,*W1_LEN)##msiDataObjClose(*D1_FD,*junk2)##msiDataObjCreate(*C,null,*D2_FD)##msiDataObjRead(*S_FD,50000,*R2_BUF)##msiDataObjWrite(*D2_FD,*R2_BUF,*W2_LEN)##msiDataObjClose(*D2_FD,*junk3)##msiDataObjClose(*S_FD,*junk4)
*A=/tempZone/home/rods/foo1%*B=/tempZone/home/rods/foo2%*C=/tempZone/home/rods/foo3
*R1_BUF%*W2_LEN%*A
29© 2006 Open Grid Forum
iRODS Demonstration
• Add and query metadata• imeta add -d foo1 speed 100 "mph"• imeta add -d foo1 length 200 "ft"• imeta add -d foo2 speed 300 "mph"• imeta add -d foo3 length 400 "ft"• imeta ls -d foo1• imeta qu -d speed = 100• imeta qu -d speed ">=" 100• imeta qu -d length ">=" 100
• Copy Metadata• imeta ls -d foo1• imeta ls -d foo3• imeta cp -d -d foo1 foo3• imeta ls -d foo3
30© 2006 Open Grid Forum
iRODS Demonstration
• Copy metadata attributes on a file to a collection• imeta ls -C /tempZone/home/rods• imeta cp -d -C foo1 /tempZone/home/rods• imeta ls -C /tempZone/home/rods
31© 2006 Open Grid Forum
Preservation Environments
• Working group task• Define the sets of
• Assertions --> set of persistent state• Management policies --> set of rules• Capabilities --> set of micro-services
• Solicit groups willing to contribute to development of rule-based technology• CASPAR• PLANETS• NARA• UK e-Science data grid• IN2P3• ARROW• Fedora preservation working group• DSpace
32© 2006 Open Grid Forum
Preservation Interoperability
• Preserve rules as property of each record• Register versions of micro-services used to
manipulate each record• Register versions of persistent state information
associated with each record
• When migrate record to a new preservation environment, migrate the rules, micro-services, and persistent state information
33© 2006 Open Grid Forum
Preservation Evolution
• Can define new • Rules• Micro-services• Persistent state information
• Can apply new rules in parallel with old rules, and take the most restrictive rule.• Means preservation management policies,
capabilities, and assertions can evolve over time.
34© 2006 Open Grid Forum
Theory of Digital PreservationDefinition of the persistent name spacesDefinition of the operations that are performed upon the persistent name spacesCharacterization of the changes to the persistent state information associated with
each persistent name space that occur for each operationCharacterization of the transformations that are made to the records for each
operationDemonstration that the set of operations is complete, enabling the decomposition of
every preservation process onto the operation set.Demonstration that the preservation management policies are complete, enabling
the validation of all preservation assessment criteria.Demonstration that the persistent state information is complete, enabling the
validation of assessment criteria.The assertion is then: if the operations are reversible, then a future preservation
environment can recreate a record in its original form, maintain authenticity and integrity, support access, and display the record.
A corollary is that such a system would allow records to be migrated between independent implementations of preservation environments, while maintaining authenticity and integrity.
35© 2006 Open Grid Forum
More Information
SRB:http://www.sdsc.edu/srb
iRODS:http://irods.sdsc.edu/
36© 2006 Open Grid Forum
Full Copyright Notice
Copyright (C) Open Grid Forum (applicable years). All Rights Reserved.
This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works.
The limited permissions granted above are perpetual and will not be revoked by the OGF or its successors or assignees.