Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.
-
Upload
cori-white -
Category
Documents
-
view
220 -
download
3
Transcript of Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.
![Page 1: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/1.jpg)
Workflows for Digital Curation and Preservation
Stacy KowalczykPASIG Dublin 2012October 17, 2012
![Page 2: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/2.jpg)
Topics
• Goals• A Very Brief Introduction to Workflow Systems• Components for Curation• Workflow Scenarios• Future Work
2
![Page 3: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/3.jpg)
3
Workflows for Curation
Goals– Increase capacity and scalability of curation efforts– Develop distributed curation processes– Lower costs of curation activities– Improve quality with systematic and repeatable
processes– Reduce human errors
![Page 4: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/4.jpg)
4
Why Workflow Systems• Repetitive and mundane
activities simplified• Facilitates and enforces best
practices • Enables efficient scheduling • Machinery for coordinating
the execution of services and linking together resources
• Facilitates outreach to researchers for direct deposit and automatic curation
![Page 5: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/5.jpg)
5
Types of Workflow SystemsKepler
BPEL
Ptolemy II
Triana
Taverna
![Page 6: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/6.jpg)
6
Trident
• Open source project• Based on Microsoft Workflow Foundation classes• Supported by Microsoft Research and academic
researchers• Integrates with myExperiment• Well accepted in the research community– well over 100 peer-reviewed and white papers were
discovered from one scholarly aggregation service• Graphical workflow design and execution interface
![Page 7: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/7.jpg)
7
Trident Workflow Components• Fixity• Data Integrity• Metadata Creation• Format Normalization
and Derivative Generation
• Persistent Identification• Repository Integration
![Page 8: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/8.jpg)
8
Fixity Components
• MD5 checksum generator
• MD5 checksum validator
![Page 9: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/9.jpg)
9
Data Integrity Components
• JHOVE for format verification and validation
• Group validation (for object integrity)
![Page 10: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/10.jpg)
10
Metadata Creation Components• MIX data generator and validator
• METS data generator and validator
![Page 11: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/11.jpg)
11
Format Components• Format Conversions for normalization and
derivative generation– .xlsx to .csv– .docx to .pdf– .ppt to .pdf– .tif to .jpg– Zipping on demand– Image (.tif or .jpg) to .pdf (single document and
multipage)
![Page 12: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/12.jpg)
12
Repository Component
• Ingest to DSpace via Sword
• DOI generator
![Page 13: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/13.jpg)
13
Data Ingest Workflows• Scenarios– Single part objects (individual images)– Multi-part objects (a book)– Multiple instantiations of a logical object (word,
pdf and ppt of a research paper)– Multiple multi-part objects (a group of letters)– Research data products (multiple files of various
types)
![Page 14: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/14.jpg)
Single Part Objects
14
![Page 15: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/15.jpg)
15
Single Part Objects Workflow
Derivative Generation
Format Validation
andVerification
Fixity Check
CreateTech
Metadata
Create Intellectual Metadata
Create Object
Metadata
PersistentIdentification
Deposit in Repository
Image Quality Checks
![Page 16: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/16.jpg)
16
Single Part Objects Workflow
• For each original image– MD5 checksum– JHOVE validation and verification report– ImageMagick report– MIX file
• For each derivative file– MD5 Checksum– DOI
• For each logical object– DC record– METS record– Sword package
![Page 17: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/17.jpg)
Multi-part Object Workflow
17
![Page 18: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/18.jpg)
18
Multi-part Object Workflow• Comic Book– RIS– Set of .tif files
CreateTech
Metadata
Derivative Generation
Format Validation
andVerification
Fixity CheckObject Integrity
Create Intellectual Metadata
Create Object
Metadata
Persistent Identification
Deposit in Repository
Image Quality Checks
![Page 19: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/19.jpg)
Multi-part Object Workflow
• For each individual image file– MD5 checksum– JHOVE validation and verification report– ImageMagick report– MIX file
• For each derivative file– MD5 Checksum
• For the whole object– DOI– DC record– METS record
• Sword Package
19
![Page 20: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/20.jpg)
Multiple Instantiations of a Logical Object Workflow
20
![Page 21: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/21.jpg)
21
Multiple Instantiations of a Logical Object Workflow
• Papers– Each logical object per subdirectory– RIS, word file and (perhaps) supplemental file
Format Normalization
Format Validation
andVerification
Fixity Check
Create Intellectual Metadata
Create Object
Metadata
Persistent Identification
Deposit in Repository
Derivative Generation
![Page 22: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/22.jpg)
Multiple Instantiations of a Logical Object Workflow
• For each original object– MD5 Checksum– JHOVE report
• For each derivative object– MD5 Checksum– Output from normalization process– DOI for delivery object
• For the whole package– METS file– DC record– Sword Package
22
![Page 23: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/23.jpg)
Multiple Multi-part Object Workflow
23
![Page 24: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/24.jpg)
24
Multiple Multi-part Object Workflow
• Ball collection– RIS for collection and Inventory spreadsheet– Each logical object in separate subdirectory
CreateTech
Metadata
Derivative Generation
Format Validation
andVerification
Fixity Check
Create Intellectual Metadata
Create Object
Metadata
Persistent Identification
Deposit in Repository
Image Quality Checks
Collection Integrity
Create Collection Metadata
![Page 25: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/25.jpg)
Multiple Multi-part Object Workflow• For each file
– MD5 checksum– JHOVE report– MIX file– Scanning specifications– Derivative files
• For each logical object– Derivative object– DC record– METS file– DOIs
• For the whole collection– METS file– DC record
25
![Page 26: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/26.jpg)
Research Data Products
26
![Page 27: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/27.jpg)
27
Research Data Products
• Vortex– A subdirectory for each experiment
Compress Data Fixity Check
Create Intellectual Metadata
Create Object
Metadata
PersistentIdentification
Deposit in Repository
![Page 28: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/28.jpg)
Research Data Products
• Outputs– Zipped data file– MD5 Checksum– FGDC metadata record – Dublin Core record– METS record– Sword Package
28
![Page 29: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/29.jpg)
29
Post Deposit Curation Workflow
• Scenarios – Fixity verification– Format normalization– New or additional derivative generation– Media migration– Persistent identifier updates– Metadata updates
![Page 30: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/30.jpg)
Future Work
• Adding additional components– EAD from spreadsheet– MARC record support– Premis support
• Testing in the lab– Digital library scanning labs– Research labs– Integrating with a production repository
30
![Page 31: Workflows for Digital Curation and Preservation Stacy Kowalczyk PASIG Dublin 2012 October 17, 2012.](https://reader035.fdocuments.us/reader035/viewer/2022062803/56649ccf5503460f9499beff/html5/thumbnails/31.jpg)
31
Acknowledgements
• This research was made possible through a generous grant by Microsoft Research
• And by the Data to Insight Center of Indiana University’s Pervasive Technology Institute
• Thanks to Kavitha Chandrashankar and Quan Zhou for their help with developing components, workflows, and documentation