Workflows for Digital Preservation and Curation Workshop Open Repositories 2012
description
Transcript of Workflows for Digital Preservation and Curation Workshop Open Repositories 2012
![Page 1: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/1.jpg)
Workflows for Digital Preservation and Curation Workshop
Open Repositories 2012
Stacy KowalczykBeth Plale
Kavitha ChandrasekarYiming Sun
![Page 2: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/2.jpg)
2
Agenda
• Introduction to Digital Curation• Workflow Systems Overview• Workflows for Digital Curation• Break• Implementing Workflows in Trident • Modifying a Workflow• Create a new Workflow • Creating Components• Wrap up
7/10/12
![Page 3: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/3.jpg)
3
Acknowledgements
• This workshop was made possible through a generous grant by Microsoft Research
• And by the Data to Insight Center of Indiana University’s Pervasive Technology Institute
• Quan Zhou, Ph.D. student and developer, for his help with developing components, workflows, and documentation
7/10/12
![Page 4: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/4.jpg)
4
Introduction to Digital Curation
• Defining curation• Infrastructure for curation• Curating the files• Curating the object
7/10/12
![Page 5: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/5.jpg)
5
Defining CurationDigital curation involves maintaining, preserving and adding value to digital research data throughout its lifecycle.The active management of research data reduces threats to their long-term research value and mitigates the risk of digital obsolescence. Meanwhile, curated data in trusted digital repositories may be shared among the wider … research community.As well as reducing duplication of effort in research data creation, curation enhances the long-term value of existing data by making it available for further high quality research.
Digital Curation Center
7/10/12
![Page 6: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/6.jpg)
6
Curation Infrastructure
• Repository• Public access• Policies• Processes• Institutional support
7/10/12
![Page 7: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/7.jpg)
7
Curating the Files
• Bitstream Integrity– Fixity– Duplicate copies
• File integrity– Format verification– Format validation
7/10/12
![Page 8: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/8.jpg)
8
File Formats
• Durability– Transparency– Documentation– Ubiquity– Renderability– Longevity
7/10/12
![Page 9: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/9.jpg)
9
Format Choices
• Master files for preservation– Highest quality– Highest fidelity– Lossless
• Derivative files for active use and delivery– Smallest possible for user needs– Fast delivery– Easy to use format
7/10/12
![Page 10: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/10.jpg)
10
Curating the Object
• Context – Relationships between files– Technical metadata– Intellectual metadata
• To Metadata– Implicit/explicit context
7/10/12
![Page 11: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/11.jpg)
11
Curation Activities
• Ongoing verification– File integrity– Object integrity
• Metadata management• Management of obsolescence– Hardware– Software– Formats– Documentation
7/10/12
![Page 12: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/12.jpg)
12
Workflow Systems
• Purpose of workflow systems• Types of workflow systems• Trident Workflow Workbench
7/10/12
![Page 13: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/13.jpg)
13
Why Workflow Systems• Repetitive and mundane
activities simplified• Facilitates and enforces best
practices • Enables efficient scheduling • Machinery for coordinating
the execution of services and linking together resources
• Facilitates outreach to researchers for direct deposit and automatic curation
7/10/12
![Page 14: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/14.jpg)
14
Types of Workflow Systems
7/10/12
Kepler
BPEL
Ptolemy II
Triana
Taverna
![Page 15: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/15.jpg)
15
Trident
• Open source project• Based on Microsoft Workflow Foundation classes• Supported by Microsoft Research and academic
researchers• Integrates with myExperiment• Well accepted in the research community– well over 100 peer-reviewed and white papers were
discovered from one scholarly aggregation service
7/10/12
![Page 16: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/16.jpg)
16
Trident Components
• Trident Management Studio• Trident Workflow Composer• Trident Workflow Application• Microsoft SQL Server• Trident Silverlight client for web execution of
workflows• Microsoft Visual Studio– C# development environment
7/10/12
![Page 17: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/17.jpg)
Design
Visual Workflow Compose
r
Trident Registry
Workflow Packages(domain specific)
Trident Runtime Services
Windows Workflow Foundation
.NET 4.0
Provenance
Monitoring
Workflow Scheduling Service
Admin
Admin Console
Workflow Monitor
Community
Web PortalsearchLaunch
Monitor
Workflow Launcher
Results RepositoryWorkflow
Repository (myExperime
nt)
Data Access Layer
Data Object Model (data source abstraction layer)
Data Storage Providers: SQL Server, Local XML store, …
![Page 18: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/18.jpg)
18
Workflows for Curation• Goals
– Systematic and repeatable processes– Helps remove human errors
• Data Ingest– Integrity checks– Format normalization/derivative generation– Metadata creations
• Curation activities– Integrity checks– Format migration– Media migration
7/10/12
![Page 19: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/19.jpg)
19
Data Ingest Workflows
• Scenarios– Single part objects (individual images)– Multi-part objects (a book)– Multiple instantiations of a logical object (word,
pdf and ppt of a research paper)– Multiple multi-part objects (a group of letters)– Research data products (multiple files of various
types)– Scientific workflow process
7/10/12
![Page 20: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/20.jpg)
20
Single Part Objects Workflow
• Magic Lantern Slides – Individual files– Spreadsheet
7/10/12
Derivative Generation
Format Validation
andVerification
Fixity Check
CreateTech
Metadata
Create Intellectual Metadata
Create Object
Metadata
PersistentIdentification
Deposit in Repository
Image Quality Checks
![Page 21: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/21.jpg)
21
Multi-part Object Workflow• Comic Book– RIS– Set of .tif files
7/10/12
CreateTech
Metadata
Derivative Generation
Format Validation
andVerification
Fixity CheckObject Integrity
Create Intellectual Metadata
Create Object
Metadata
Persistent Identification
Deposit in Repository
Image Quality Checks
![Page 22: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/22.jpg)
22
Multiple Instantiations of a Logical Object Workflow
• Papers– Each logical object per subdirectory– RIS, word file and (perhaps) supplemental file
7/10/12
Format Normalization
Format Validation
andVerification
Fixity CheckCreateTech
Metadata
Create Intellectual Metadata
Create Object
Metadata
Persistent Identification
Deposit in Repository
Derivative Generation
![Page 23: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/23.jpg)
23
Multiple Multi-part Object Workflow
• Ball collection– RIS for collection and Inventory spreadsheet– Each logical object in separate subdirectory
7/10/12
CreateTech
Metadata
Derivative Generation
Format Validation
andVerification
Fixity CheckObject
Integrity
Create Intellectual Metadata
Create Object
Metadata
Persistent Identification
Deposit in Repository
Image Quality Checks
Collection Integrity
Create Collection Metadata
![Page 24: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/24.jpg)
24
Research Data Products
• Vortex– Each subdirectory is an experiment with FGDC
metadata
7/10/12
Compress Data Fixity Check
Create Intellectual Metadata
Create Object
Metadata
PersistentIdentification
Deposit in Repository
![Page 25: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/25.jpg)
25
Workflow Components• Format Conversions (for normalization and
derivative generation)– .xlsx to .csv– .docx to .pdf– .ppt to .pdf– .tif to .jpg– Zipping on demand– Image (.tif or .jpg) to .pdf
7/10/12
![Page 26: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/26.jpg)
26
Workflow Components 2
• Context creation– MIX data generator and validator– METS data generator and validator
• Data Integrity– MD5 checksum generator– MD5 checksum validator– JHOVE for format verification and validation– Group validation (for object integrity)
7/10/12
![Page 27: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/27.jpg)
27
Post Deposit Curation Workflow
• Scenarios – Fixity verification– Format normalization– New or additional derivative generation– Media migration– Persistent identifier updates– Metadata updates
7/10/12
![Page 28: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/28.jpg)
28
Workflows in Trident
7/10/12
![Page 29: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/29.jpg)
29
Executing Workflows
7/10/12
• Individual object ingest• Multipart object ingest• Multiple multipart object ingest• Multiple instantiations of a single logical
object• Research data ingest• Scientific workflow • Fixity check curation workflow
![Page 30: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/30.jpg)
30
Implementing Workflows in Trident
• Launch the Remote Desktop application
• User: AMAZONA-JJOAL14\oruser
• PWD: TridentOR12!!• Computer ip addresses
on slip of paper being passed out now.
7/10/12
![Page 31: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/31.jpg)
31
Trident Workflow Composer
7/10/12
![Page 32: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/32.jpg)
32
Participant Exercises
7/10/12
![Page 33: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/33.jpg)
33
Modifying Workflows
• Add components to existing workflows• Select the Individual Ingest Workflow– Add DOI component• Before the METS generator component• Make the connections
• Select the Group Ingest Workflow Comic– Add the METS generation component• After the last component in the main line• Make the connections
7/10/12
![Page 34: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/34.jpg)
34
Simple Curation Workflow Creation
• Create a Workflow for a simple curation process – validate MD5 checksums– Define a directory of image files– Define a METS file– Define an out put location– Link the MD5 checksum validation component– Link the MD5 checksum report component– Save and execute the workflow
7/10/12
![Page 35: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/35.jpg)
35
Creating Components
• Exercise:– Create a new Trident workflow component– Implement the MARCXML to MODS Stylesheet
http://www.loc.gov/standards/mods/v3/MARC21slim2MODS3-4.xsl
– Kavitha Chandrasekar will demonstrate the process
7/10/12
![Page 36: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/36.jpg)
36
Wrap Up
• Thumb drives• Trident codeplex site• Trident listserv• Contributing to Trident• Workshop Evaluation Form• Ongoing conversation
7/10/12
![Page 37: Workflows for Digital Preservation and Curation Workshop Open Repositories 2012](https://reader036.fdocuments.us/reader036/viewer/2022062315/568164e6550346895dd74db5/html5/thumbnails/37.jpg)
37
Contacts for Further Discussion
• Trident CodePlex site: http://tridentworkflow.codeplex.com/
• Trident Listserv: [email protected]
• Stacy Kowalczyk: [email protected]• Kavitha Chandrasekar: [email protected] • Yiming Sun: [email protected] • Quan Zhou: [email protected]
7/10/12