© 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

36
© 2006 Open Grid Forum Preservation Enviroment Rese arch Group Rule-based preservation

Transcript of © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

Page 1: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

© 2006 Open Grid Forum

Preservation Enviroment Research GroupRule-based preservation

Page 2: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

2© 2006 Open Grid Forum

OGF IPR Policies Apply

• “I acknowledge that participation in this meeting is subject to the OGF Intellectual Property Policy.”• Intellectual Property Notices Note Well: All statements related to the activities of the OGF and

addressed to the OGF are subject to all provisions of Appendix B of GFD-C.1, which grants to the OGF and its participants certain licenses and rights in such statements. Such statements include verbal statements in OGF meetings, as well as written and electronic communications made at any time or place, which are addressed to:

• the OGF plenary session, • any OGF working group or portion thereof, • the OGF Board of Directors, the GFSG, or any member thereof on behalf of the OGF, • the ADCOM, or any member thereof on behalf of the ADCOM, • any OGF mailing list, including any group list, or any other list functioning under OGF auspices, • the OGF Editor or the document authoring and review process

• Statements made outside of a OGF meeting, mailing list or other function, that are clearly not intended to be input to an OGF activity, group or function, are not subject to these provisions.

• Excerpt from Appendix B of GFD-C.1: ”Where the OGF knows of rights, or claimed rights, the OGF secretariat shall attempt to obtain from the claimant of such rights, a written assurance that upon approval by the GFSG of the relevant OGF document(s), any party will be able to obtain the right to implement, use and distribute the technology or works when implementing, using or distributing technology based upon the specific specification(s) under openly specified, reasonable, non-discriminatory terms. The working group or research group proposing the use of the technology with respect to which the proprietary rights are claimed may assist the OGF secretariat in this effort. The results of this procedure shall not affect advancement of document, except that the GFSG may defer approval where a delay may facilitate the obtaining of such assurances. The results will, however, be recorded by the OGF Secretariat, and made available. The GFSG may also direct that a summary of the results be included in any GFD published containing the specification.”

• OGF Intellectual Property Policies are adapted from the IETF Intellectual Property Policies that support the Internet Standards Process.

Page 3: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

3© 2006 Open Grid Forum

OGF20 Preservation Environments Research Group

• Organizers: Reagan Moore ([email protected])"Bruce.Barkstrom"

<[email protected]>• Goals:

• Present archives based on data grid technology• INCIPIT virtual vellum

• Analyze capabilities required by a preservation environment• Define rule-based preservation environment - iRODS• NARA Electronic Records Archive capability requirements• RLG/NARA assessment criteria for a Trusted Digital Repository• Barkstrom GGF paper - based on NASA Langley preservation model

• Analyze capabilities that can be based on grid technology• iRODS rule-oriented data system

• Participants:• 19 contributors to data grid federation for GIN• MIT - PLEDGE project on preservation policies• SDSC - NARA research prototype persistent archive• U Md - Producer Archive Workflow Network• EU CASPAR, PLANETS; UK Digital Curation Centre

Page 4: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

4© 2006 Open Grid Forum

Virtual Vellum

• Preserve shared collection of medieval manuscripts• Preserve provenance of manuscripts (authenticity)• Preserve display services (access)• Preserve arrangement (respect des fonds)• Preserve bits (integrity)

• http://www.dcs.shef.ac.uk/~mikem/virtualvellum• Based on SRB data grid and Virtual Vellum services• P F Ainsworth <[email protected]>

Page 5: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

5© 2006 Open Grid Forum

Preservation Environment Requirements

• Bruce Barkstrom paper• Described capabilities needed in a preservation environment

• ERA capabilities list• http://www.crl.edu/content.asp?l1=13&l2=58&l3=160

• RLG/NARA trusted digital repository assessment criteria• http://www.dlib.org.ar/dlib/july06/ross/07ross.html

• Can we express these capabilities and assessment criteria as rules applied by the data grid?

Page 6: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

6© 2006 Open Grid Forum

ERA Capabilities

• List of 854 required capabilities:• Management of disposition agreements describing how record retention and disposal actions• Accession, the formal acceptance of records into the data management system• Arrangement, the organization of the records to preserve a required structure (implemented as

a collection/sub-collection hierarchy)• Description, the management of descriptive metadata as well as text indexing• Preservation, the generation of Archival Information Packages• Access, the generation of Dissemination Information Packages• Subscription, the specification of services that a user picks for execution• Notification, the delivery of notices on service execution results• Queuing of large scale tasks through interaction with workflow systems• System performance and failure reports. Of particular interest is the identification of all

failures within the data management system and the recovery procedures that were invoked.• Transformative migration, the ability to convert specified data formats to new standards. In

this case, each new encoding format is managed as a version of the original record.• Display transformation, the ability to reformat a file for presentation.• Automated client specification, the ability to pick the appropriate client for each user.

Page 7: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

7© 2006 Open Grid Forum

RLG/NARA TDR Assessment Criteria

• The assessment criteria can be mapped to management policies.• The management policies can be mapped to a set of rules whose execution

can be automated.• The rules require definition of input parameters that define the assertion being

implemented.• The execution of the rules generates state information that can be evaluated

to verify the assertion result• The types of rules that are needed include:

• Specification of assertions (setting rule parameters - flags and descriptive metadata)

• Deferred consistency constraints that may be applied at any time• Periodic rules that execute defined procedures • Atomic rules applied on each operation (access controls, audit trails)

• The rules determine the metadata attributes that need to be managed• Set of 174 rules

Page 8: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

8© 2006 Open Grid Forum

Digital Preservation

• Preservation is communication with the future• How do we migrate records onto new technology

(information syntax, encoding format, storage infrastructure, access protocols)?

• SRB - Storage Resource Broker data grid provides the interoperability mechanisms needed to manage multiple versions of technology

• Preservation manages communication from the past• What information do we need from the past to make

assertions about preservation assessment criteria (authenticity, integrity, chain of custody)?

• iRODS - integrated Rule-Oriented Data System

Page 9: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

9© 2006 Open Grid Forum

Socialization of Data Management

Data ManagementEnvironment

ConservedProperties

ControlMechanisms

RemoteOperations

ManagementFunctions

AssessmentCriteria

ManagementPolicies

Capabilities

Data grid – Management virtualizationData Management

InfrastructurePersistent

StateRules Micro-services

Data grid – Data and trust virtualizationPhysical

InfrastructureDatabase Rule Engine Storage

System

iRODS - integrated Rule-Oriented Data SystemiRODS - integrated Rule-Oriented Data System

Page 10: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

10© 2006 Open Grid Forum

Preservation Management Policies

• Authenticity• Validate assertions made at time of data ingestion • Validate existence of the descriptive (provenance) metadata• Validate retention policy is consistent with submission agreement

• Integrity• Maintain information about the management of the data• Assertions made by the archivist

• Access controls, audit trails, checksums, replication, synchronization, federation

• Infrastructure independence• Manage properties of records independently of choice of storage system

• Scalability• Manage large collections (billions of records, petabytes of data,

thousands of attributes)• Aggregations across name spaces

Page 11: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

11© 2006 Open Grid Forum

iRODS

• Separate definition of management policies (rules)from definition of remote operations (micro-services)

• Control execution of all micro-services through application of rules

• Manage persistent state information for the results

• Query the persistent state information to validate assertions on preservation properties

Page 12: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

12© 2006 Open Grid Forum

iRODS - integrated Rule-Oriented Data System

Resources

Client Interface Admin Interface

MetadataModifierModule

ConfigModifierModule

RuleModifierModule

ConsistencyCheck

Module

Confs

RuleBase

MetadataPersistent

Repository

Engine

Rule

Current State

Rule Invoker

MicroService

Modules

Resource-based Services

MicroService

Modules

Metadata-based Services

ServiceManager

ConsistencyCheck

Module

ConsistencyCheck

Module

Page 13: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

13© 2006 Open Grid Forum

Managing Preservation Policies

• Require at least six name spaces for managing identity• Logical storage resource name space• Logical user name space• Logical file name space• Logical rule name space• Logical micro-service name space• Logical persistent state name space

• Require ability to federate name spaces• Cross-register identity of object from each of the name spaces

• Require multiple levels of aggregation for each name space• Typically three levels of aggregation

• Trust virtualization• Ownership of the collection entities by the data grid

Page 14: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

14© 2006 Open Grid Forum

Metadata Attributes

• Associate state information with each name space• User name

• Address, institution• Group membership• Type - (administrator, curator, owner, public)

• Logical file name• System attributes

• Location, size, owner, checksum, container, …

• User-defined attributes• Descriptive information

• Logical resource name• Type of system• Quotas

Page 15: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

15© 2006 Open Grid Forum

Federation Between Data Grids

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical rule name space

• Logical micro-service name

• Logical persistent state

Data Collection B

Data Access Methods (Web Browser, DSpace, OAI-PMH)

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical rule name space

• Logical micro-service name

• Logical persistent state

Data Collection A

Page 16: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

16© 2006 Open Grid Forum

Aggregation of Identifiers

• Users• {Single user, group, federation}

• Resources• {Single storage system, cached system, cluster}

• Files• {Single file, container, directory}

• Metadata• {Single attribute, hierarchical table, collection}

• Management policies• {Single capability, set of capabilities, nested rules}

Page 17: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

17© 2006 Open Grid Forum

Demonstration of Rules

• Rule specified in four parts• Single line, parts separated by the symbol “ | ”

Name | condition | function-calls | recovery-calls• Name• Conditions• Functions calls• Recovery calls

• Support multiple functions, separated by symbol “##”acDeleteUser | |

acDeleteDefaultCollections##msiDeleteUser##msiCommit |

msiRollback##msiRollback##nop

Page 18: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

18© 2006 Open Grid Forum

Three Classes of Rules

• Internal rules • Used within iRODS for standard data

manipulation services

• Administrator rules• Set by data grid administrator to enforce

policies on shared collection

• User-defined rules • Support server-driven workflows

Page 19: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

19© 2006 Open Grid Forum

Rule-based Data Management

• Associate rules with combinations of name spaces• Rule set for a particular collection• Rule set for a particular user group• Rule set for a particular user group when

accessing a particular collection• Rule set for a particular storage system• Rule set for a particular micro-service• Generic rules based on SRB operations

Page 20: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

20© 2006 Open Grid Forum

Administrative Rules

• Currently 15 administrative rules• Administrative• Storage selection• Data pre-processing• Data post-processing• Data deletion• Parallel I/O

Page 21: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

21© 2006 Open Grid Forum

Administration Creation Rules

acCreateUser | | msiCreateUser##acCreateDefaultCollections##msiCommit | msiRollback##msiRollback##nop

acVacuum(*arg1) | | delayExec(msiVacuum,*arg1) | nop

acCreateDefaultCollections | | acCreateUserZoneCollections | nop

acCreateUserZoneCollections | | acCreateCollByAdmin(/$rodsZoneProxy/home,$otherUserName)##acCreateCollByAdmin(/$rodsZoneProxy/trash/home,$otherUserName) | nop##nop

acCreateCollByAdmin(*parColl,*childColl) | | msiCreateCollByAdmin(*parColl,*childColl) | nop

Page 22: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

22© 2006 Open Grid Forum

Administration Deletion Rules

acDeleteUser | | acDeleteDefaultCollections##msiDeleteUser##msiCommit | msiRollback##msiRollback##nop

acDeleteDefaultCollections | | acDeleteUserZoneCollections | nop

acDeleteUserZoneCollections | | acDeleteCollByAdmin(/$rodsZoneProxy/home,$otherUserName)##

acDeleteCollByAdmin(/$rodsZoneProxy/trash/home,$otherUserName) | nop##nop

acDeleteCollByAdmin(*parColl,*childColl) | | msiDeleteCollByAdmin(*parColl,*childColl) | nop

Page 23: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

23© 2006 Open Grid Forum

Data Manipulation Rules

Rule for pre-processing on storage useacSetRescSchemeForCreate | |

msiSetDefaultResc(demoResc,noForce)##msiSetRescSortScheme(random)##msiSetRescSortScheme(byRescType) | nop##nop##nop

Rule for pre-processing on data readsacPreprocForDataObjOpen | | msiSortDataObj(random) | nop

Rule for post processing data writesacPostProcForPut | | nop | nopacPostProcForCopy | | nop | nop

Rule for setting number of threads for parallel I/O acSetNumThreads | | msiSetNumThreads(default,default,default) | nop

Rule for data deletion policy settingacDataDeletePolicy | | nop | nop

Page 24: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

24© 2006 Open Grid Forum

iRODS Demonstration

• Demonstrate generic put command• ilsresc• ils -l nvo• iput -R demoResc ../src/icd.c nvo• ils -l nvo

• Revise put command to automatically create a replica• cp core.irb.1 ../../../server/config/reConfigs/core.irb• ils -l nvo • iput -R demoResc ../src/ipwd.c nvo• ils -l nvo

• Illustrate execution of a user-defined rule• icd • iput carl.ged foo1• irule -vF ruleInp3

Page 25: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

25© 2006 Open Grid Forum

iRODS Demonstration

# iRODS Rule Base - core.irb# Each rule consists of four parts separated by | # The four parts are: name, conditions, function calls, and recovery.# The calls and recoveries can be multiple ones, separated by ##.# For each rule, the number recovery calls should match the calls;# for example, if the 2nd call fails, the 2nd recover call is made.# acPreprocForDataObjOpen | | msiSortDataObj(random) | nop acSetRescSchemeForCreate | |

msiSetDefaultResc(demo2Resc,noForce)##msiSetRescSortScheme(random)##msiSetRescSortScheme(byRescType) | nop##nop##nop

acDataDeletePolicy | | nop | nopacPostProcForPut | | nop | nop

Page 26: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

26© 2006 Open Grid Forum

iRODS Demonstration

# iRODS Rule Base - core.irb# Each rule consists of four parts separated by | # The four parts are: name, conditions, function calls, and recovery.# The calls and recoveries can be multiple ones, separated by ##.# For each rule, the number recovery calls should match the calls;# for example, if the 2nd call fails, the 2nd recover call is made.# acPreprocForDataObjOpen | | msiSortDataObj(random) | nop acSetRescSchemeForCreate | | msiSetDefaultResc(demo2Resc,noForce)##

msiSetRescSortScheme(random)##msiSetRescSortScheme(byRescType) | nop##nop##nop

acDataDeletePolicy | | nop | nopacPostProcForPut | | nop | nop

Page 27: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

27© 2006 Open Grid Forum

iRODS Demonstration

# iRODS Rule Base# Each rule consists of four parts separated by | # The four parts are: name, conditions, function calls, and recovery.# The calls and recoveries can be multiple ones, separated by ##.# For each rule, the number of recovery calls should match the calls;# for example, if the 2nd call fails, the 2nd recovery call is made.#acPreprocForDataObjOpen | | msiSortDataObj(random) | nopacSetRescSchemeForCreate | | msiSetDefaultResc(demo2Resc,noForce)##

msiSetRescSortScheme(random)##msiSetRescSortScheme(byRescType) | nop##nop##nop

acDataDeletePolicy | | nop | nopacPostProcForPut | $objPath like /tempZone/home/rods/nvo/* |

msiSysReplDataObj(nvoReplResc) | nopacPostProcForPut | | nop | nop

Page 28: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

28© 2006 Open Grid Forum

iRODS Demonstration

# This is an example of an input for the irule command.# This first input line is the rule body# The second input line is the input parameter in the format of label=value. # Multiple inputs can be specified using the '%' character as the separator.# The third input line is the output description. For multiple outputs use '%’ myTestRule | | msiDataObjOpen(*A,*S_FD)##

msiDataObjCreate(*B,null,*D1_FD)##msiDataObjRead(*S_FD,100,*R1_BUF)##msiDataObjWrite(*D1_FD,*R1_BUF,*W1_LEN)##msiDataObjClose(*D1_FD,*junk2)##msiDataObjCreate(*C,null,*D2_FD)##msiDataObjRead(*S_FD,50000,*R2_BUF)##msiDataObjWrite(*D2_FD,*R2_BUF,*W2_LEN)##msiDataObjClose(*D2_FD,*junk3)##msiDataObjClose(*S_FD,*junk4)

*A=/tempZone/home/rods/foo1%*B=/tempZone/home/rods/foo2%*C=/tempZone/home/rods/foo3

*R1_BUF%*W2_LEN%*A

Page 29: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

29© 2006 Open Grid Forum

iRODS Demonstration

• Add and query metadata• imeta add -d foo1 speed 100 "mph"• imeta add -d foo1 length 200 "ft"• imeta add -d foo2 speed 300 "mph"• imeta add -d foo3 length 400 "ft"• imeta ls -d foo1• imeta qu -d speed = 100• imeta qu -d speed ">=" 100• imeta qu -d length ">=" 100

• Copy Metadata• imeta ls -d foo1• imeta ls -d foo3• imeta cp -d -d foo1 foo3• imeta ls -d foo3

Page 30: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

30© 2006 Open Grid Forum

iRODS Demonstration

• Copy metadata attributes on a file to a collection• imeta ls -C /tempZone/home/rods• imeta cp -d -C foo1 /tempZone/home/rods• imeta ls -C /tempZone/home/rods

Page 31: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

31© 2006 Open Grid Forum

Preservation Environments

• Working group task• Define the sets of

• Assertions --> set of persistent state• Management policies --> set of rules• Capabilities --> set of micro-services

• Solicit groups willing to contribute to development of rule-based technology• CASPAR• PLANETS• NARA• UK e-Science data grid• IN2P3• ARROW• Fedora preservation working group• DSpace

Page 32: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

32© 2006 Open Grid Forum

Preservation Interoperability

• Preserve rules as property of each record• Register versions of micro-services used to

manipulate each record• Register versions of persistent state information

associated with each record

• When migrate record to a new preservation environment, migrate the rules, micro-services, and persistent state information

Page 33: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

33© 2006 Open Grid Forum

Preservation Evolution

• Can define new • Rules• Micro-services• Persistent state information

• Can apply new rules in parallel with old rules, and take the most restrictive rule.• Means preservation management policies,

capabilities, and assertions can evolve over time.

Page 34: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

34© 2006 Open Grid Forum

Theory of Digital PreservationDefinition of the persistent name spacesDefinition of the operations that are performed upon the persistent name spacesCharacterization of the changes to the persistent state information associated with

each persistent name space that occur for each operationCharacterization of the transformations that are made to the records for each

operationDemonstration that the set of operations is complete, enabling the decomposition of

every preservation process onto the operation set.Demonstration that the preservation management policies are complete, enabling

the validation of all preservation assessment criteria.Demonstration that the persistent state information is complete, enabling the

validation of assessment criteria.The assertion is then: if the operations are reversible, then a future preservation

environment can recreate a record in its original form, maintain authenticity and integrity, support access, and display the record.

A corollary is that such a system would allow records to be migrated between independent implementations of preservation environments, while maintaining authenticity and integrity.

Page 35: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

35© 2006 Open Grid Forum

More Information

[email protected]

SRB:http://www.sdsc.edu/srb

iRODS:http://irods.sdsc.edu/

Page 36: © 2006 Open Grid Forum Preservation Enviroment Research Group Rule-based preservation.

36© 2006 Open Grid Forum

Full Copyright Notice

Copyright (C) Open Grid Forum (applicable years). All Rights Reserved.

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works.

The limited permissions granted above are perpetual and will not be revoked by the OGF or its successors or assignees.