Information Management and Distributed Data Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar...

26
Information Management Information Management and Distributed Data and Distributed Data Reagan W. Moore Reagan W. Moore Wayne Schroeder Wayne Schroeder Mike Wan Mike Wan Arcot Rajasekar Arcot Rajasekar Richard Marciano Richard Marciano { moore , schroede , mwan , sekar , marciano }@sdsc.edu

Transcript of Information Management and Distributed Data Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar...

Information Management and Information Management and Distributed DataDistributed Data

Reagan W. MooreReagan W. Moore

Wayne SchroederWayne Schroeder

Mike WanMike Wan

Arcot RajasekarArcot Rajasekar

Richard MarcianoRichard Marciano

{moore, schroede, mwan, sekar, marciano}@sdsc.edu

http://www.sdsc.edu/srb

http://irods.sdsc.edu/http://irods.sdsc.edu/

Data Grid SoftwareData Grid Software

• Storage Resource Broker• http://www.sdsc.edu/srb

• iRODS - integrated Rule-Oriented Data System• http://irods.sdsc.edu• Version 0.9 released May 30, 2007• Version 1.0 scheduled for fall, 2007• Open source - BSD license

ConceptsConcepts

• Distributed Data Management Concepts• Data virtualization

• Infrastructure independence

• Trust virtualization• Administrative domain independence

• Federation

• Rule-based Data Management• Management virtualization

• Automating execution of management policies• Coupling management policies to assertions about

data

Data VirtualizationData Virtualization

• Manage properties of each digital entity independently of the remote storage systems• Infrastructure independence

• Properties• Name spaces• Persistent state information (location, size,…)

• Manage standard operations• Client actions• Operations performed at remote storage systems

Data VirtualizationData Virtualization

Storage SystemStorage System

Storage ProtocolStorage Protocol

Access InterfaceAccess Interface

Standard Access ActionsStandard Access Actions

Data GridData Grid

Map from the Map from the

actions requested byactions requested by

the access methodthe access method

to a standard set ofto a standard set of

micro-services used micro-services used

to interact with theto interact with the

storage systemstorage system

Standard Micro-servicesStandard Micro-services

Federation Between Data GridsFederation Between Data Grids

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical rule name space

• Logical micro-service name

• Logical persistent state

Data Collection B

Data Access Methods (Web Browser, DSpace, OAI-PMH)

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical rule name space

• Logical micro-service name

• Logical persistent state

Data Collection A

Production Data Grids: Production Data Grids: ObservationsObservations

• Data grids manage shared collections that are distributed across multiple storage systems and institutions• Data grids are responsible for providing recovery

mechanisms for all errors that occur in the distributed environment

• The number of observed problems is proportional to the size of the collections

• Need to minimize labor costs by automating:• Application of management policies • Execution of administrative functions for error recovery• Validation of preservation assessment criteria

Observations of Production Observations of Production Data GridsData Grids

• Each community implements different management polices

• Need a mechanism to support the socialization of shared collections• Community specific preservation objectives• Community specific assertions about

properties of the shared collection• Community specific management policies

iRODSiRODS

•What additional levels of virtualization are required to support advanced data management applications?

•Observe that each community imposes different management policies.• Different criteria for data retention, disposition, access

control, data caching, replication• Assertions on collection integrity and authenticity such as

required metadata• Assertions on data distribution, data transport

•Need the ability to characterize management policies, automate their application, and verify collection properties

Socialization of Data CollectionsSocialization of Data Collections• Management policies are a mechanism for the "socialization" of

a collection. The management policies describe how the collection can be accessed by a broader community, the internal consistency mechanisms that maintain the reputation of the builders of the collection, and the collection consistency properties that the broader community can expect when they access the data.

• Management policies transform from the expectations of the designated community that built the collection to the expectations of the wider world that uses the collection.

• While management policies are unique for each record collection, generic management policies exist that can be tuned to represent the "socialization" of the collection.

Data ManagementData Management

Data ManagementEnvironment

ConservedProperties

ControlMechanisms

RemoteOperations

ManagementFunctions

AssessmentCriteria

ManagementPolicies

Capabilities

Data grid – Management virtualizationData Management

InfrastructurePersistent

StateRules Micro-services

Data grid – Data and trust virtualizationPhysical

InfrastructureDatabase Rule Engine Storage

System

iRODS - integrated Rule-Oriented Data SystemiRODS - integrated Rule-Oriented Data System

Rule-based Data ManagementRule-based Data Management

• Map from management policies to rules controlling execution of remote micro-services

• Manage persistent state information for results of each micro-service execution

• Support an additional three logical name spaces• Rules• Micro-services• Persistent state information

iRODS - integrated Rule-iRODS - integrated Rule-Oriented Data SystemOriented Data System

Resources

Client Interface Admin Interface

MetadataModifierModule

ConfigModifierModule

RuleModifierModule

ConsistencyCheck

Module

Confs

RuleBase

MetadataPersistent

Repository

Engine

Rule

Current State

Rule Invoker

MicroService

Modules

Resource-based Services

MicroService

Modules

Metadata-based Services

ServiceManager

ConsistencyCheck

Module

ConsistencyCheck

Module

Management VirtualizationManagement Virtualization• Standard policies expressed as rules• Rules control execution of data management

and access operations• Integrity

• Validation of checksums• Synchronization of replicas• Data distribution• Data retention• Access controls

• Authenticity• Chain of custody - audit trails• Required preservation metadata - templates• Generation of AIPs, DIPS

Example RulesExample Rules

• Rule composed of four parts:• Name | condition | micro-service set | recovery

• Rule to automate replication of data for a specific collection

acPostProcForPut |

$objPath like /tempZone/home/rods/nvo/* | msiSysReplDataObj(nvoReplResc,null) |

nop

• Rule types• Internal, administrative, user-defined• Atomic, deferred, periodic

Three Classes of Rules

• Internal rules • Used within iRODS for standard data

manipulation services

• Administrator rules• Set by data grid administrator to enforce

policies on shared collection

• User-defined rules • Support server-driven workflows

Rule-based Data Management

• Associate rules with combinations of name spaces• Rule set for a particular collection• Rule set for a particular user group• Rule set for a particular user group when

accessing a particular collection• Rule set for a particular storage system• Rule set for a particular micro-service• Generic rules based on SRB operations

Administrative Rules

• Currently 15 administrative rules• Administrative• Storage selection• Data pre-processing• Data post-processing• Data deletion• Parallel I/O

Administration Creation Rules Creation Rules

acCreateUser | | msiCreateUser##acCreateDefaultCollections##msiCommit | msiRollback##msiRollback##nop

acVacuum(*arg1) | | delayExec(msiVacuum,*arg1) | nop

acCreateDefaultCollections | | acCreateUserZoneCollections | nop

acCreateUserZoneCollections | | acCreateCollByAdmin(/$rodsZoneProxy/home,$otherUserName)##acCreateCollByAdmin(/$rodsZoneProxy/trash/home,$otherUserName) | nop##nop

acCreateCollByAdmin(*parColl,*childColl) | | msiCreateCollByAdmin(*parColl,*childColl) | nop

Administration Deletion Rules

acDeleteUser | | acDeleteDefaultCollections##msiDeleteUser##msiCommit | msiRollback##msiRollback##nop

acDeleteDefaultCollections | | acDeleteUserZoneCollections | nop

acDeleteUserZoneCollections | | acDeleteCollByAdmin(/$rodsZoneProxy/home,$otherUserName)##

acDeleteCollByAdmin(/$rodsZoneProxy/trash/home,$otherUserName) | nop##nop

acDeleteCollByAdmin(*parColl,*childColl) | | msiDeleteCollByAdmin(*parColl,*childColl) | nop

Data Manipulation RulesRule for pre-processing on storage useacSetRescSchemeForCreate | |

msiSetDefaultResc(demoResc,noForce)##msiSetRescSortScheme(random)##msiSetRescSortScheme(byRescType) | nop##nop##nop

Rule for pre-processing on data readsacPreprocForDataObjOpen | | msiSortDataObj(random) | nop

Rule for post processing data writesacPostProcForPut | | nop | nopacPostProcForCopy | | nop | nop

Rule for setting number of threads for parallel I/O acSetNumThreads | | msiSetNumThreads(default,default,default) | nop

Rule for data deletion policy settingacDataDeletePolicy | | nop | nop

Planned DevelopmentPlanned Development

• Implement the rules and micro-services needed for the listed ERA capabilities• Have identified 174 micro-services

• Data manipulation• Structured information manipulation

• Have identified 212 persistent state attributes

• Implement the rules and micro-services needed to validate assessment criteria for trusted digital repositories• Have identified 176 rules

ERA Capability CategoriesERA Capability Categories

• Accession• Arrangement• Description• Preservation• Access• Disposition• Subscription• Notification• Task queuing • Transformative migration• Display transformation• Automated client specification• System performance and failure reports.

Summary of Mapping ERA Summary of Mapping ERA Capabilities to Management RulesCapabilities to Management Rules

• ERA integrates capabilities of multiple systems • PAWN submission pipeline - 34 operations• Cheshire indexing system - 13 operations• Kepler workflow - 53 operations• iRODS data management - 597 operations• Operations facility - the remaining capabilities

• The 597 operations are executed by 174 generic rules• The analysis identified five types of metadata attributes:

• Collection metadata - 11 attributes• File metadata - 123 attributes• User metadata - 38 attributes• Resource metadata - 9 attributes• Rule metadata - 32 attributes

Example ERA CapabilitiesExample ERA Capabilities• Record manipulation• List files• Display file (template)• Format file• Delete file• Delete file authorized• Delete file copies• Delete file versions• Erase file• Replace file• Set file version• Create soft link• Replicate file• Synchronize replicas• Physically move file• Annotate file• Access URL• Check vault• Monitor space used• Output file • Register file• Regenerate system metadata• Set number of items per display page

• Structured information • DIP format template• Disposition agreement format template• Disposition action format template• Physical location report template• Inventory report template• Data movement summary report template• Access report template• File migration report template• Document internal access control template• AIP format template• Transfer format template• Access review determination rule template• Access review determination report template• Validate access classification rule template• File transfer discrepancy report template• Notification review report template• Redaction rule template• Search display template• File display template (file type)• Format conversion format template• Workbench display template• Request help format template

Theory of Data ManagmentTheory of Data Managment Characterization

Persistent name spaces Operations that are performed upon the persistent name spaces Changes to the persistent state information associated with each persistent

name space that occur for each operation Transformations that are made to the records on each operation

Completeness Set of operations is complete, enabling the decomposition of every data

management process onto the operation set. Data management policies are complete, enabling the validation of all data

assessment criteria. Persistent state information is complete, enabling the validation of authenticity

and integrity and management policies. Assertion

If the operations are reversible, then a future data management environment can recreate a record in its original form, maintain authenticity and integrity, support access, and display the record.

Such a system would allow records to be migrated between independent implementations of data management environments, while maintaining authenticity and integrity.