PAWN: A Novel Ingestion Workflow Technology for Scientific Data
description
Transcript of PAWN: A Novel Ingestion Workflow Technology for Scientific Data
![Page 1: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/1.jpg)
PAWN: A Novel Ingestion Workflow Technology for Scientific Data
Mike Smorul, Joseph JaJa, Yang Wang, Mike McGann, and Fritz McCall
![Page 2: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/2.jpg)
Overall Principles
Distributed, secure ingestion Use of web/grid technologies – platform
independent Minimal client-side requirements Ease of integration with data grid systems. Designed to satisfy data integrity requirements
of scientific collections and digital preservation
![Page 3: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/3.jpg)
Producer
Producer Management Interface
Producer data suppliers
Data Grid Gateway
Management Server
![Page 4: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/4.jpg)
Producer
Provides data to a data grid based on a prior agreement.
Consists of a management/metadata server and an ingestion client.
Provides initial arrangement, context, and metadata.
![Page 5: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/5.jpg)
Data Grid - receiving
Bitstream Validation Service
Data Grid
Scheduler
Producer 1
Producer n
Producer 2
![Page 6: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/6.jpg)
Data Grid – receiving
Receives data from a Producer Validates bitstreams and metadata, and
sends acknowledgement to Producer. Arranges into collections and specifies
optional publishing and preservation policy.
Publishes bitstreams into data grid.
![Page 7: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/7.jpg)
Data Grid – Long term Stewardship
Implemented using grid technologies.
Use the existing prototype NARA/UMD/SDSC site.
Automated replication and integrity checking.
Enforces access control and preservation policy
![Page 8: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/8.jpg)
Ingestion Workflow
1. Negotiate Submission Agreement.
2. Workflow Initialization and Submission Information Packet (SIP) creation.
3. Transfer of SIPs to Data Grid site.
4. Validation of SIP transfer
5. Organization of data into collections and transfer into Data Grid.
![Page 9: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/9.jpg)
Submission Agreement
Create machine actionable set of rules describing items.
Final Submission Agreement is composed of:
METS document for application defaults METS Constraint document to limit METS
form to submission parameters
![Page 10: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/10.jpg)
METS Overview
Provides a framework for linking structural organization of objects with metadata.
Using XML namespace, metadata from various XML schema can be attached to objects Ie, dublin core, FGDC, etc
Extensible for more complex metadata http://www.loc.gov/standards/mets/
![Page 11: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/11.jpg)
Sample METS Document<?xml version="1.0" encoding="utf-8" standalone="no"?><mets xmlns="http://www.loc.gov/METS/" xmlns:xlink="http://www.w3.org/TR/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd"><metsHdr><agent ROLE="CREATOR"><name>toaster@hostname</name>
</agent></metsHdr><fileSec><fileGrp><file ID="5" MIMETYPE="application/octet-stream" SIZE="67624" CREATED="2002-08-21T15:36:05"
CHECKSUM="2CE7D79E40BD6C6A65A6684B6FD3D08C" CHECKSUMTYPE="MD5"><FLocat LOCTYPE="URL" xlink:type="simple" xlink:href="/nfshomes/toaster/iscsi/GFS-contrib-5.1.tar.gz"/>
</file></fileGrp><fileGrp><file ID="7" MIMETYPE="application/octet-stream" SIZE="2517" CREATED="2002-09-06T17:06:07"
CHECKSUM="767185AA022180E701324C592E1C36E3" CHECKSUMTYPE="MD5"><FLocat LOCTYPE="URL" xlink:type="simple" xlink:href="/nfshomes/toaster/iscsi/gfs.out"/>
</file></fileGrp>
</fileSec><structMap><div ID="3" LABEL="iscsi"><fptr FILEID="5"/><fptr FILEID="7"/>
</div></structMap>
</mets>
MetadataLinking
StructuralOrganization
![Page 12: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/12.jpg)
Why METS Constraints?
METS doesn’t provide a way to create machine interpretable rules describing a collection Ie: allow only TIFF files in certain structural
areas METS profiles allow for developer
interpretable rules, not machine interpretable
![Page 13: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/13.jpg)
METS Constraints
Allows structural, metadata, and file constraints.
Structural Constraints:Restrict child div’s and restrict pointers to div, file,
and other mets documents File Constraints:
Restrict files by mime-type or validation tests Metadata Constraints:
Restrict allowed metadata schema.
![Page 14: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/14.jpg)
METS Constraints - Template<?xml version="1.0" encoding="UTF-8"?><mets …. >
<!-- validation test section, referenced in the constraints document --><amdSec>
<techMD ID="xmltest"><mdWrap MDTYPE="OTHER">
<xmlData><val:validation NAME="xmltext" DESCRIPTION="Test for valid xml documents" MIMETYPE="text/xml">
<val:valgrp required="true"><val:valtest name="gif" required="true">
<val:description>generic gif test for any file</val:description></val:valtest>
</val:valgrp></val:validation>
</xmlData></mdWrap>
</techMD></amdSec>
<!-- base div structure to use for all clients --><structMap>
<div ID="ID1" LABEL="Research & Development Records"><div ID="ID1.1" LABEL="Research & Development Project Records">
<div ID="ID1.1.1" LABEL="R&D Project Case Files"/><div ID="ID1.1.2" LABEL="R&D Record Series"/>
</div></div>
</structMap></mets>
![Page 15: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/15.jpg)
METS Constraints - Rules
<?xml version="1.0" encoding="UTF-8"?><metsconstraint …>
<filegrp ID="FILE1" NAME="Text Document"><!-- Files can be identified either by MIMETYPE, or TESTID in skeleton METS document or both --><file NAME="html document" MIMETYPE="text/html"/><file TESTID="xmltext" NAME="xml document" MIMETYPE="text/xml"/>
</filegrp>
<!-- Apply rules to predefined div's and link to required file/metadata tests above -->
<divrule DIVID="ID1" RESTRICTDIV="true" RESTRICTFTPR="true" RESTRICTMPTR="true"/><divrule DIVID="ID1.1" RESTRICTDIV="true" RESTRICTFTPR="true" RESTRICTMPTR="true"/><divrule DIVID="ID1.1.1" RESTRICTMPTR="true">
<filetype FILEGROUPID="FILE1"/></divrule><divrule DIVID="ID1.1.2" RESTRICTMPTR="true"/>
</metsconstraint>
![Page 16: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/16.jpg)
Ingestion Workflow
1. Negotiate Submission Agreement.
2. Workflow Initialization and Submission Information Packet creation.
3. Transfer of SIPs to Data Grid site.
4. Validation of SIP transfer
5. Organization of data into collections and transfer into Data Grid.
![Page 17: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/17.jpg)
Initialize Ingestion workflow
Instantiate Producer management server to track registered objects
Establish a working trust relationship with the Data Grid
Issue clients.
![Page 18: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/18.jpg)
Create SIP
Each client registers objects stored locally with producer management serverRegister file types, validation tests, etcClient follows rules in Submission Agreement
Producer-wide agents can arrange registered object to give a broader context
![Page 19: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/19.jpg)
SIP Example
Submission packet is designed to contain a self describing set of metadata that is self-validating
· Physical Object· Representation
Information
· Provenance· Fixity· Reference · Context
Packaging Information
Descriptive Information
Content InformationPreservation Description
Information
OAIS Information packet
![Page 20: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/20.jpg)
Client Interface
![Page 21: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/21.jpg)
Ingestion Workflow
1. Negotiate Submission Agreement.
2. Workflow Initialization and Submission Information Packet creation.
3. Transfer of SIPs to Data Grid site.
4. Validation of SIP transfer
5. Organization of data into collections and transfer into Data Grid.
![Page 22: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/22.jpg)
Transfer SIP to Data Grid
Retrieve previously registered SIP from producer management server
Authenticate to data grid Update tracking information with new
location of files in data grid Data Grid acknowledges transfer
completion to producer management server
![Page 23: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/23.jpg)
Ingestion Workflow
1. Negotiate Submission Agreement.
2. Workflow Initialization and Submission Information Packet creation.
3. Transfer of SIPs to Data Grid site.
4. Validation of SIP transfer
5. Organization of data into collections and transfer into Data Grid.
![Page 24: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/24.jpg)
Validation of SIP transfer
Check incoming SIP against constraints documents.
Ensure object integrity by verifying checksums/cryptographic digest
Validate bitstreams against necessary tests
Record validation results
![Page 25: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/25.jpg)
Ingestion Workflow
1. Negotiate Submission Agreement.
2. Workflow Initialization and Submission Information Packet creation.
3. Transfer of SIPs to Data Grid site.
4. Validation of SIP transfer
5. Organization of data into collections and transfer into Data Grid.
![Page 26: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/26.jpg)
Final transfer to Data Grid
Transfer objects to Data Grid Update tracking information with new
location in Data Grid Transfer log of data activity into data grid Return accept/reject messages to
producer metadata server
![Page 27: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/27.jpg)
Component Overview
CRL check
Success/Failure notification of ingestion
Metadata registration/retrieval
Producer Management Interface Data Grid Management Interface
Producer data suppliers
SIP transfer
Bitstream Validation Service
Data Grid
![Page 28: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/28.jpg)
Producer Components
Database to track registered objects Certificate Authority management
Web service for receiving side security callback Management server supplies web service
interfaces to ingestion clients and management operations.
Clients are designed to be standalone, with security certificates issued by producer
![Page 29: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/29.jpg)
Receiving Components
Receiving servers validate connecting clients and validate SIPs
Validation Services are simple webservice calls.
Abstract I/O layer into data grid.
![Page 30: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/30.jpg)
Recap
Implemented using web technologies Architecture independent XML based metadata
METS based SIPsAdd-on constraints describing Submission Agreement
Target release dates:Beta: AprilRelease: June/July
![Page 31: PAWN: A Novel Ingestion Workflow Technology for Scientific Data](https://reader036.fdocuments.us/reader036/viewer/2022062520/56815915550346895dc63ffe/html5/thumbnails/31.jpg)
More Information
ADAPT websitehttp://www.umiacs.umd.edu/research/adapt
PapersScalable, Reliable Marshalling and
Organization of Distributed Large Scale Data Onto Enterprise Storage Environments
PAWN: Producer - Archive Workflow Network in Support of Digital Preservation