D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and...

29
Wf4Ever: Advanced Workflow Preservation Technologies for Enhanced Science STREP FP7-ICT-2007-6 270192 Objective: ICT-2009.4.1b — “Advanced preservation scenarios” D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase II Deliverable Co-ordinator: Khalid Belhajjame Deliverable Co-ordinating Institution: University of Manchester Other Authors: Daniel Garijo, Graham Klyne, Raul Palma, Piotr Hołubowicz, Esteban García-Cuesta, and José Manuel Gómez-Pérez This deliverable describes the second phase of delivery of workflow lifecycle management com- ponents. It includes a description of the Research Object Model, which facilitates interoperation between components; the RO Manager command line tool; the Research Object Digital Library; the RO-enabled myExperiment; and a definition of models for workflow abstraction and indexa- tion. Document Identifier: Wf4Ever/2013/D2.2v2/1.1 Date due: July 31, 2013 Class Deliverable: Wf4Ever FP7-ICT-2007-6 270192 Submission date: July 31, 2013 Project start date December 1, 2010 Version: 1.1 Project duration: 3 years State: Draft Distribution: Public 2013 © Copyright lies with the respective authors and their institutions.

Transcript of D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and...

Page 1: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

Wf4Ever: Advanced Workflow Preservation Technologies for Enhanced Science

STREP FP7-ICT-2007-6 270192

Objective: ICT-2009.4.1b — “Advanced preservation scenarios”

D2.2v2 Design, implementation and deployment ofworkflow lifecycle management components - Phase II

Deliverable Co-ordinator: Khalid Belhajjame

Deliverable Co-ordinating Institution: University of Manchester

Other Authors: Daniel Garijo, Graham Klyne, Raul Palma, Piotr Hołubowicz,Esteban García-Cuesta, and José Manuel Gómez-Pérez

This deliverable describes the second phase of delivery of workflow lifecycle management com-ponents. It includes a description of the Research Object Model, which facilitates interoperationbetween components; the RO Manager command line tool; the Research Object Digital Library;the RO-enabled myExperiment; and a definition of models for workflow abstraction and indexa-tion.

Document Identifier: Wf4Ever/2013/D2.2v2/1.1 Date due: July 31, 2013Class Deliverable: Wf4Ever FP7-ICT-2007-6 270192 Submission date: July 31, 2013Project start date December 1, 2010 Version: 1.1Project duration: 3 years State: Draft

Distribution: Public

2013 © Copyright lies with the respective authors and their institutions.

Page 2: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

Page 2 of 29 Wf4Ever STREP FP7-ICT-2007-6 270192

Wf4Ever Consortium

This document is part of the Wf4Ever research project funded by the IST Programme of the Commissionof the European Communities by the grant number FP7-ICT-2007-6 270192. The following partners areinvolved in the project:

Intelligent Software Components S.A. (ISOCO) –Coordinator

University of Manchester (UNIMAN)

Edificio Testa, Avda. del Partenón 16-18, 1o, 7a School of Computer ScienceCampo de las Naciones, 28042 Madrid Oxford Road, Manchester M13 9PLSpain United KingdomContact person: Jose Manuel Gómez Pérez Contact person: Carole GobleE-mail address: [email protected] E-mail address: [email protected] Politécnica de Madrid (UPM) Instytut Chemii Bioorganicznej PAN - Poznan Su-

percomputing and Netowrking Center (PSNC)Departamento de Inteligencia Artificial, Facultad Network Services Departmentde Informática. 28660 Boadilla del Monte. Madrid Ul Z. Noskowskiego 12-14 61704 PoznanSpain PolandContact person: Oscar Corcho Contact person: Raul PalmaE-mail address: [email protected] E-mail address: [email protected] of Oxford (OXF) Instituto de Astrofísica de Andalucía (IAA)Department of Zoology Dpto. Astronomía Extragaláctica.South Parks Road, Oxford OX1 3PS Glorieta de la Astronomía s/n, 18008 GranadaUnited Kingdom SpainContact person: Jun Zhao, David De Roure Contact person: Lourdes Verdes-MontenegroE-mail address: [email protected] E-mail address: [email protected]@oerc.ox.ac.ukLeiden University Medical Centre (LUMC)Department of Human GeneticsAlbinusdreef 2, 2333 ZA LeidenThe NetherlandsContact person: Marco RoosE-mail address: [email protected]

Page 3: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 3 of 29

Work package participants

The following partners have taken an active part in the work leading to the elaboration of this document, evenif they might not have directly contributed to the writing of this document or its parts:

• iSOCO

• OXF

• PSNC

• UNIMAN

• UPM

Change Log

Version Date Amended by Changes0.1 01-06-2013 Khalid Belhajjame Initial outline0.2 09-06-2013 Khalid Belhajjame Initial draft of Section 2 on the RO model0.3 12-06-2013 Graham Klyne Added the RO manager section0.4 14-06-2013 Daniel Garijo Added the workflow abstraction section0.5 14-06-2013 Khalid Belhajjame Added the introduction and a first draft of

the myExperiment section0.6 14-06-2013 Khalid Belhajjame Revised all sections and added the sum-

mary section0.7 21-06-2013 Khalid Belhajjame Addressed QA comments received from

Oscar Corcho on all sections, except Sec-tions 4 and 7

0.8 21-06-2013 Raul Palma Research Object Digital Library section0.8.1 21-06-2013 Piotr Hołubowicz Research Object Digital Library UML dia-

grams and alignment0.8.2 24-06-2013 Raul Palma Research Object Digital Library section

references and fixes0.9 05-07-2013 Esteban Carcía Added workflow indexing section1.0 08-07-2013 José Manuel Gómez-Pérez Added notion of RO management portfo-

lio and connected to D8.5v2. Some minorre-structuring

1.1 08-07-2013 José Manuel Gómez-Pérez Corrected some typos

2013 © Copyright lies with the respective authors and their institutions.

Page 4: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

Page 4 of 29 Wf4Ever STREP FP7-ICT-2007-6 270192

Executive Summary

This deliverable describes the second phase of delivery of workflow lifecycle management components.These components are focused around the Wf4Ever Research Object Model (RO Model), which providesdescriptions of workflow-centric ROs – aggregations of content. This model is used to structure and describeROs which are then stored and manipulated by the components of the Wf4Ever Toolkit.The RO Model provides a framework for describing aggregations of content along with annotations of theaggregated resources, a vocabulary for describing workflows, and a vocabulary for describing provenance.The application of the model within the project has not resulted any significant changes, suggesting thatthe model captures user requirement adequately and has reached a level of maturity. We provide herea description of the RO model. We also present the components developed for creating and managingResearch Objects: the RO Manager – the Research Object Digital Library.One of the main developments in the last year consists in incorporating Research Objects within the myExper-iment environment to allow scientists who already use myExperiment to create, share and reuse ResearchObjects. We discuss the efforts that went into this task, and show how myExperiment is using ResearchObject Digital Library as a back-end for storing and archiving Research Objects.We describe advanced management functions that we developed for abstracting and indexing workflows,with the aim of supporting the discovery and reuse of workflows. We present an ontology that we developedfor abstracting workflows using motifs that characterize data manipulation and transformation patterns. Wealso report on a solution that we developed for indexing workflows based on the services (processes) thatthey use.This deliverable should be read in tandem with D1.3v2 (Wf4Ever Architecture – Phase II), D1.4v2 (ReferenceWf4Ever Implementation – Phase II), D1.2v3 (Wf4Ever Sandbox – Phase III), D3.2v2 (Design, implementa-tion and deployment of Workflow Evolution, Sharing and Collaboration components – Phase II) and D4.2v2(Design, implementation and deployment of Workflow Integrity and Authenticity Maintenance components –Phase II) in order to provide a complete picture of the state of the Wf4Ever Phase II components.This deliverable supersedes D2.2v1, and can be read as a standalone document.

Page 5: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 5 of 29

Contents

1 Introduction 7

2 The Research Object Model 72.1 Research Object Core Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 RO Extension Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Research Object Management Tools 123.1 Research Object Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Research Object-Enabled myExperiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Research Object Digital Library 174.1 The interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.2 The implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 RODL clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Abstracting and Indexing Workflows 215.1 Workflow Abstraction using Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5.1.1 Representing Motifs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.1.2 Representing Workflows and Workflow Steps . . . . . . . . . . . . . . . . . . . . . . . 23

6 Indexing Workflows 246.1 Applications and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

6.1.1 Searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256.1.2 Next step recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

7 Summary 27

Bibliography 28

2013 © Copyright lies with the respective authors and their institutions.

Page 6: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

Page 6 of 29 Wf4Ever STREP FP7-ICT-2007-6 270192

List of Figures

1 The Abstract Model of Research Objects, and the Ontologies that Realize it . . . . . . . . . . 82 RO as an ORE aggregation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 The wfdesc ontology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 The wfprov ontology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 The roevo ontology extending PROV-O core terms. . . . . . . . . . . . . . . . . . . . . . . . . 116 Research Object management tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 RO Manager sequence diagram illustrating interactions with the user. . . . . . . . . . . . . . . 148 RO-enabled myExperiment (except RODL, the modules in the above figure belong to the my-

Experiment infrastructure). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 A Sequence diagram illustrating how myExperiment can be used to create Research Objects. . 1610 Research Objects Digital Library internal component diagram . . . . . . . . . . . . . . . . . . 1711 The sequence diagram for creating a Research Object in RODL . . . . . . . . . . . . . . . . . 1812 The sequence diagram for aggregating a new resource in a Research Object . . . . . . . . . . 1813 The sequence diagram for creating a snapshot or release from the RO Evolution API client

perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1914 The sequence diagram for preparing the snapshot of a Research Object . . . . . . . . . . . . 2015 The sequence diagram for finalizing the snapshot and making it immutable . . . . . . . . . . . 2016 The sequence diagram for storing Research Objects in dArceo . . . . . . . . . . . . . . . . . 2017 Sample motifs in a Taverna workflow for functional genomics. . . . . . . . . . . . . . . . . . . 2218 Diagram showing an overview of the class taxonomy of the motif OWL ontology. . . . . . . . . 2319 Subset of the annotations of the Taverna workflow shown in Figure 17 using the wfdesc model. 2420 Sorting the "Extract proteins using a gi - output as fasta file" example obtained from ProvBench 2521 Sequence diagram for workflows indexing, searching, and recommending next step processes. 26

List of Ontologies

1. RO ontology: http://purl.org/wf4ever/ro#

2. Wfdesc ontology: http://purl.org/wf4ever/wfdesc#

3. Wfprov ontology: http://purl.org/wf4ever/wfprov#

4. ROEvo ontology: http://purl.org/wf4ever/roevo#

Page 7: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29

1 Introduction

This deliverable describes Phase II of the design, implementation and deployment of the Wf4Ever compo-nents that will support workflow lifecycle management. The document should be read in tandem with otherMonth 32 deliverables, in particular D3.2v2 (Design, implementation and deployment of Workflow Evolu-tion, Sharing and Collaboration components – Phase II) [GC+13b] and D4.2v2 (Design, implementation anddeployment of Workflow Integrity and Authenticity Maintenance components – Phase II) [GC+13a] whichaddress complementary aspects of the overall wf4ver architecture and components. This deliverable super-sedes D2.2v1, and can be read as a standalone document.According to the Description of Work, This prototype will include the following functionalities: new versionsof the Research Object model and ontology network, advanced management functions (filtering, cluster-ing, etc.), playback functionalities for reproducibility, and workflow classification, indexing and explanationtechniques..These requirements are addressed in the following way:Section 2 presents the Research Object Model defined within Wf4Ever. Specifically, we present a family ofontologies for specifying Research Objects and their associated resources, e.g., workflow, workflow runs,etc.Section 3 presents the tools that we have developed for assisting users in creating and managing ResearchObjects. More specifically, Section 3.1 presents the Research Object Manager (RO Manager), a commandline tool for creating, displaying and manipulating Research Objects. Section 3.2 shows how the myExper-iment virtual environment [RGS09], was extended to allow end-users, who are not necessarily informationtechnology experts, to create, share, publish and curate Research Objects.Section 4 presents the Research Object Digital Library (RODL), which acts as a full-fledged back-end notonly for scientists but also for librarians.Section 5 presents the motif ontology that we developed for abstracting scientific workflows, and illustrateshow it has been used to document workflows. It then goes on to present a solution that we developed forindexing workflows based on the processes (steps) they are composed of, with the purpose of assisting usersin discovering workflows that are of interest to them.

2 The Research Object Model

The design of the Research Object model was informed by a systematic analysis of requirements expressedby scientists from the life sciences and astronomy fields1. The application of the model within the project hasnot resulted any significant changes, suggesting that the model captures user requirement adequately andhas reached a level of maturity. Figure 1 gives an overview of the abstract model of the Research Objectmodel, distinguishing between core and extended requirements. There are three core requirements that havebeen identified, namely a mechanism for uniquely identifying Research Objects, a means for aggregatingresources within a Research Object, and the ability to annotate the Research Object, its constituent resourcesand their relationships. The extended requirements highlight the need for specifying workflows (experiments),provenance traces of their executions, the evolution of a Research object over time, as well as mechanismsfor expressing their dependencies.As illustrated in Figure 1, we have realized the Research Object abstract model using a family of ontologies,which we will present in the rest of this section. Notice that we do not define ontologies for specifyingdependencies, investigations, components and studies, since there exist ontologies that implement theseaspects, e.g., the Dublin Core ontology2 provide the means for specifying dependencies, and the ISA model3

provides concepts for specifying investigations and studies.

1http://www.wf4ever-project.org/wiki/display/docs/User+Requirements2http://dublincore.org3http://www.isa-tools.org

2013 © Copyright lies with the respective authors and their institutions.

Page 8: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

Page 8 of 29 Wf4Ever STREP FP7-ICT-2007-6 270192

Figure 1: The Abstract Model of Research Objects, and the Ontologies that Realize it

2.1 Research Object Core Ontology

The Core Research Object Ontology (Core RO) provides the minimum terms that are essential to the spec-ification of Research Objects. Specifically, it caters for two main requirements4 by providing a containerstructure that can be used by the scientists to bundle the resources and material relevant for their investi-gation, and by enabling annotations of such a container, its resources, as well as the relationships betweenresources thereby making the Research Object interpretable and reusable.To cater for the specification of aggregation structures, we built the Research Object Core Ontology upon thepopular ORE vocabulary5. ORE defines standards for the description and exchange of aggregations of Webresources. Figure 2 illustrates the main terms that constitute the Research Object Core Ontology, which wedescribe in what follows.

Figure 2: RO as an ORE aggregation.

• ro:ResearchObject6, represents an aggregation of resources. It is a sub-class of ore:Aggregationand acts as an entry point to the Research Object.

4http://www.wf4ever-project.org/wiki/display/docs/User+Requirements5www.openarchives.org/ore6The namespace of the Research Object Core Ontology ro is http://purl.org/net/wf4ever/ro#

Page 9: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 9 of 29

• ro:Resource, represents a resource that can be aggregated within a Research Object and is a sub-class of ore:AggregatedResource. A resource can be a Dataset, Paper, Software or Annotation.Typically, a ro:ResearchObject aggregates multiple ro:Resource, and this relationship is specifiedusing the property ore:aggregates.

• ro:Manifest, a sub-class of ore:ResourceMap, represents a resource that is used to describe aro:ResearchObject. It plays a similar role to the manifest in a JAR or a ZIP file, and is primarily usedto list the resources that are aggregated within the Research Object.

The second core requirement that the Research Object Core Ontology caters for is the descriptions of theResearch Object and its elements. We chose the Annotation Ontology (AO) release 2.0b2 [COC+11]. Toannotate Research Objects, we make use of the following three Annotation Ontology terms ao:Annotation7,which represents the annotation itself; ao:Target, which is used to specify the ro:Resource(s) orro:ResearchObject(s) subject to annotation; and ao:Body, which comprises a description of the target.In the case of Research Objects, we use annotations as a means for decorating a resource (or a set of re-sources) with metadata information. The body is specified in the form of a set of RDF statements, which canbe used to, e.g., specify the date of creation of the target or its relationship with other resources or ResearchObjects. Also, annotations can be provided for human consumption (e.g. a description of a hypothesis thatis tested by a workflow-based experiment), or for machine consumption (e.g. a structured description of theprovenance of results generated by a workflow run). Both kinds of annotations are accommodated usingAnnotation Ontology structures.

2.2 RO Extension Ontologies

We present in this section two extensions to the core Research Object ontology. The first specializes thekinds of resources that the Research Object can aggregate. In particular, we present extensions to specifymethod and experiments and the traces of their executions. The second kind of extension shows how spe-cific metadata information, describing the evolution of the Research Object over time, can be specified byspecializing the Research Object core ontology.

Specifying Workflows To describe workflow Research Objects the workflow description vocabularywfdesc8 defines several specific resources that are involved in a workflow specification. The choice of theseresources was performed by examining the commonalities between major data driven workflows [Shi07],namely Taverna9, Wings10 and Galaxy11.Figure 3 illustrates the terms that compose the wfdesc ontology. Using this ontology, a workflow is describedusing the following three main terms:

• wfdesc:Workflow refers to a network in which the nodes are processes and the edges represent datalinks. It is defined as a subclass of the Plan concept from the PROV-O ontology, which represents aset of actions or steps intended by one or more agents to achieve some goals [LSM+13].

• wfdesc:Process is used to describe a class of actions that when enacted give rise to process runs.Processes specify the software component (e.g., web service) responsible for undertaking those ac-tions.

• wfdesc:DataLink is used to encode the data dependencies between the processes that constitutea workflow. Specifically, a data link connects the output of a given process to the input of anotherprocess, specifying that the artifacts produced by the former are used to feed the latter.

7The namespace of ao is http://purl.org/ao/8The name space of wfdesc is http://purl.org/wf4ever/wfdesc#.9http://www.taverna.org.uk

10http://http://wings-workflows.org11http://galaxyproject.org

2013 © Copyright lies with the respective authors and their institutions.

Page 10: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

Page 10 of 29 Wf4Ever STREP FP7-ICT-2007-6 270192

Figure 3: The wfdesc ontology.

Describing Experimental Provenance using the wfprov Ontology The wfprov ontology is used to de-scribe the provenance traces obtained by enacting workflows. It is defined as an extension to the ongoingW3C PROV standard ontology - PROV-O12.

Figure 4: The wfprov ontology.

Figure 4 illustrates the structure of the wfprov ontology and its alignments with the W3C PROV-O ontology. Aa workflow run (wfprov:WorkflowRun) represents the enactment of a given workflow. It is composed of a setof process runs (wfprov:ProcessRun), each representing the enactment of a process. A process run mayuse some artifacts (wfprov:Artifact) as input and generate others as output. A process run is enacted bya workflow engine (wfprov:WorkflowEngine), which can be seen as a PROV software agent.By chaining the usage and generation of artifact together, the wfprov ontology allows scientists to trace thelineage of workflow results. For example the user can identify the input artifacts that were used to feed thewokflow run (as a whole) to obtain a given output that was generated by the workflow run.

12Note that the wfprov is reported in the W3C PROV Working Group implementation report.

Page 11: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 11 of 29

Tracking Research Object Evolution using the roevo Ontology The roevo ontology is another extensionto the minimal core ontology for describing an important aspect of Research Objects, its life cycle. To trackthe life cycle of a Research Object, we need to describe its changes at different levels of granularity about theResearch Object as a whole and about the individual resources. Also, we want to provide sufficient details totrack the changes in order to roll back to a particular version or to quality control changes. Therefore, we needto describe when the change took place, who performed the change, and dependency relationships betweenthe changes. Change is closely related to the provenance of a particular version of a Research Object or aresource. A study of the latest PROV-O ontology shows that it indeed provides all the foundational informationelements for us to build the evolution ontology.

Figure 5: The roevo ontology extending PROV-O core terms.

Figure 5 illustrates the core concepts of this ontology and how it extends the PROV-O:

• To capture different status of a Research Object, depending on whether it is being edited, shared orarchived, we created three sub-classes of ro:ResearchObject. The roevo:LiveRO is a Research Ob-ject that is being edited to capture research findings during a live investigation. The roevo:ArchivedROcan be regarded as a production Research Object to be preserved and archived, such as one de-scribing findings published in an article, and it can no longer be changed. The roevo:SnapshotROrepresents a live Research Object at a particular time, e.g., a Research Object that was submitted fora review before publication.

• Both a snapshot of a live Research Object and an archived Research Object can be regarded asa versioned Research Object, i.e. a roevo:VersionableResource, Because it is a sub-class ofprov:Entity, we can reuse PROV-O properties to describe the provenance or changes of this en-tity, such as pointing to the activity leading to any of its changes, the source Research Object that itwas derived from, and the agent involved in its change.

• A change is a prov:Activity, which means that it has a start time, an end time, an input entityand a resulting entity. Also a change leading to a new Research Object can constitute a series ofchanges. Therefore, we have a composite roevo:ChangeSpecification activity, which has a numberof unit roevo:Changes. A unit change can be adding, removing or modifying a resource or a ResearchObject. But these different changes share the same pattern of taking an input entity and producing anoutput entity, which can all be nicely covered by properties from PROV-O.

As well as the above vocabularies, the Research Object model makes use of existing vocabularies, in par-ticular, FOAF13, DCTerms14, CITO15, and SIOC16 to provide Research Objects designers with the means

13http://xmlns.com/foaf/spec/14http://dublincore.org/documents/dcmi-terms/15http://vocab.ox.ac.uk/cito16http://sioc-project.org/ontology

2013 © Copyright lies with the respective authors and their institutions.

Page 12: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

Page 12 of 29 Wf4Ever STREP FP7-ICT-2007-6 270192

Figure 6: Research Object management tools.

to express aspects such as the people who were involved in the creation of a Research Object, its cita-tion, as well as dependencies that the Research Object may have. For instance, we make use of the termdc:requires to specify that a the execution of a workflow requires other resources, e.g., plugins, credential,or specific execution environment.

3 Research Object Management Tools

This section presents the tools developed for assisting users in creating and managing Research Objects.These tools are aimed towards different types of target users and their needs and represent different levelsof deployment of Wf4Ever RO management capabilities according to such requirements. RO Manager,myExperiment, and RODL comprise our portfolio of RO management tools (see Figure 6). Details from theexploitation point of view about the portfolio will be given in the final exploitation plan (deliverable D8.5v2) ofWf4Ever’s project results.The Research Object Manager (RO Manager, Section 3.1) is a command line tool for creating, displayingand manipulating Research Objects, which incorporates the essential functionalities for RO management,especially by developers and in general a technically skilled audience used to working in a command-lineenvironment. myExperiment (Section 3.2) has been extended [RGS09] to allow end-users, who are notnecessarily information technology experts, to create, share, publish and curate workflow-centric ResearchObjects. The RO-enabled myExperiment incorporates a considerable amount of Wf4Ever’s developments,especially the RO Model and Research Object management capabilities for end users. Finally, the Re-search Object Digital Library (RODL, Section 4) acts as a full-fledged back-end not only for scientists butalso for librarians and potentially other communities interested in the aggregation of heterogeneous infor-mation sources with the rigor of digital libraries’ best practices. RODL provides a holistic approach to thepreservation of aggregated information sources and incorporates most of Wf4Ever’s capabilities dealing withRO management, collaboration, versioning and evolution, and quality management.

Page 13: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 13 of 29

3.1 Research Object Manager

The Research Object Manager (RO Manager) is a command line tool for creating, displaying and manipu-lating Research Objects. The RO Manager is complementary to RODL (see Section 4), in that it is primarilydesigned to support a user working with ROs in the host computer’s local file system, with the intention be-ing that the RODL and RO Manager can exchange ROs between them, using of the shared RO model andvocabularies. The RO Manager code base also includes the checklist evaluation functionality, described inD4.2 [GC+13a], which can be invoked using a command line or REST web interface.Experience has shown that a simple command-line tool can provide developers and users with early accessto functionality, and provide an opportunity to gather additional user feedback and requirements. RO Managerhas also been used in conjunction with built-in operating system functionality for scripting prototype toolchains for more complex operations involving Research Objects.The RO Manager allows users and developers to:

• Create local ROs;

• Add resources to an RO;

• Add annotations to an RO;

• Read and write ROs to the RODL;

• Perform checklist evaluation of an RO;

• Obtain a raw dump of Research Object metadata.

To illustrate how the user can interact with the RO manager to manipulate Research Objects, Figure 7 showsinteractions for three RO Manager operations, ro create, ro add and ro annotate, which exemplify typicallocal RO management operations.The four interacting elements presented are the user-issued command (/user), the RO Manager program(/RO_Manager), an internal RO metadata object (/ro_metadata) that manages the RO aggregation andannotation metadata, and the local file system (/file_system) where ROs are persistently stored and man-aged.From this, it can be seen that:

• The ro create command initializes an RO structure by interacting directly with the file system.

• The ro add command uses the RO URI to initialize an ro_metadata object, and calls itsaddAggregatedResources() method to incorporate one or more files into the RO aggregation. Thero_metadata object updates the RO metadata structures in the file system through a series or readand write operations.

• The ro annotate command similarly uses the RO URI to initialize an ro_metadata object, and readsthe existing annotations from disk. New annotations may be supplied as an attribute/value or at-tribute/link pair in which a case a new annotation graph is created in the file system. Otherwise thenew annotation may already exist as a graph. In either case, the local copy of the RO manifest isupdated to record the new annotation. The annotation may be applied to multiple resources in the RO.Eventually, the updated manifest is written to the file system by the ro_metadata object.

The RO manager is documented in a user guide that is available online17. An FAQ describing how to dealwith various common operations using RO Manager is also accessible online 18.

17http://wf4ever.github.io/ro-manager/doc/RO-manager.html18http://www.wf4ever-project.org/wiki/display/docs/RO+Manager+FAQ

2013 © Copyright lies with the respective authors and their institutions.

Page 14: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

Page 14 of 29 Wf4Ever STREP FP7-ICT-2007-6 270192

Figure 7: RO Manager sequence diagram illustrating interactions with the user.

Page 15: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 15 of 29

Figure 8: RO-enabled myExperiment (except RODL, the modules in the above figure belong to the myExper-iment infrastructure).

The RO Manager is implemented in Python, and is available as an installable package through the PythonPackage Index (PyPI) 19. The source code is maintained in the Wf4ever Github repository20. The RO Man-ager is heavily dependent on RDFLib21, which provides RDF parsing, formatting and SPARQL Query capa-bilities. The RO Web service uses the Pyramid22 web framework, and uritemplate23 for RFC 657024 templateexpansion.

3.2 Research Object-Enabled myExperiment

In this section, we describe how myExperiment [RGS09] was extended in order to cater for the sharing,publication and curation of Research Objects. myExperiment is a virtual research environment targetedtowards collaborations for sharing and publishing workflows (and experiments). It provides the functionalitiesnecessary for sharing workflows within and across multiple communities. In doing so, myExperiment adoptsa social web approach, which is adapted to the need of scientists. The workflows that are shared usingmyExperiment do not need to be specified in a particular workflow management system. For example, wefind on myExperiment workflows that have been specified using Galaxy [G+05], Taverna [WHF+13], Kepler[LAB+06] and Vistrails [CFS+06a].While initially targeted towards workflows, the creators of myExperiment were aware that scientists want toshare more than just workflows and experiments. Because of this, myExperiemnt was extended to supportthe sharing of artifacts known as Packs. A pack can be seen as a basic aggregation of resources, which canbe workflows, but also files, presentations, papers, or links to external resources. The notion of packs hasbeen widely adopted by scientists. At the time of writing, myExperiment had 337 packs. Just like a workflow,using myExperiment a pack can be annotated and shared.In order to support complex forms of sharing, reuse and preservation, we have worked during the last year onincorporating the notion of Research Objects (which can be seen as advanced packs) into the development

19https://pypi.python.org/pypi/ro-manager20https://github.com/wf4ever/ro-manager21https://github.com/RDFLib22http://docs.pylonsproject.org/projects/pyramid/23http://code.google.com/p/uri-templates/24http://tools.ietf.org/html/rfc6570

2013 © Copyright lies with the respective authors and their institutions.

Page 16: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

Page 16 of 29 Wf4Ever STREP FP7-ICT-2007-6 270192

Figure 9: A Sequence diagram illustrating how myExperiment can be used to create Research Objects.

version of myExperiment 25. In addition to the basic aggregation supported by packs, alpha myExperiment26

provides the mechanisms for specifying metadata that describes the relationships between the resourceswithin the aggregation. Moreover, the structure and the types of the resources that compose a pack are nowinline with those that have been identified thanks to the Research Object model. For example, a user is ableto specify that a given file within a pack specifies the hypothesis, that another file specifies the workflow runobtained by enacting a given workflow, or that a given file states the conclusions drawn by the scientists afteranalyzing the workflow run.Figure 8 illustrates a high-level architecture of alpha myExperiment, the development version of myExper-iment into which the Research Objects capabilities were incorporated. As illustrated in the figure, at thelevel of the Rails27 model, data structures that represent the Research Object and associated resourceshave been incorporated. To manipulate such data structures, the controller layer has been extended, and toprovide non-information technology users with the ability to create and manage Research Objects, the viewlayer has been extended with the necessary HTML Web pages.To illustrate how myExperiment can be used for managing Research Objects, Figure 9 depicts a sequenceUML diagram illustrating a typical sequence of interactions that the user undergoes to create and sharea Research Object. Alice (the user) first browses myExperiment to identify a workflow that is of interestto her investigation. Once she identified a relevant workflow, she downloads the workflow, modifies andre-purposes it for her investigation. Once she is happy with the new workflow, Alice decides to create aResearch Object. In doing so, she specifies the hypothesis within a file, which is stored within RODL. RODLacts as a back-end for myExperiment to store the information about Research Objects. Alice then uploadsher workflow to myExperiment. As a result, myExperiment sends a request to the RO transformationservice, which uploads the workflow definition to RODL, transforms the workflow definition into wfdesc,and extracts the annotations that are bundled within the workflow definition. These elements, i.e., wfdescspecification and annotations, are then uploaded to the Research Object in RODL. Alice also uploads the

25http://alpha.myexperiment.org/packs/26It is worth noting that once the development in the myExperiment alpha is judged mature, the new functionalities will be staged

to the production version of myExperiment.27http://rubyonrails.org

Page 17: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 17 of 29

workflow runs obtained as a result of enacting her workflow, and specifies the conclusion she comes to atthe end of her investigation.Using myExperiment, Alice now has a Research Object, compliant with the models from Section 2, viewableand manipulable as a pack through myExperiment, and enriched with a hypothesis and conclusions that canassist other users in understanding and possibly reusing and re-purposing her research results.

4 Research Object Digital Library

The foundational service to preserve workflow-centric Research Objects is the Research Object Digital Li-brary (RODL), which realizes the Storage and Lifecycle functionalities prescribed by Wf4Ever Architecture[DRHPP13]. RODL is a software system which collects, manages and preserves aggregations of scientificworkflows and related objects and annotations, packed into Research Objects. RODL is a back-end servicethat does not directly provide a user interface, but rather system level interfaces through which client softwarecan interact with RODL and provide different user interfaces as tailored to needs. This section presents theinterfaces supported by RODL, and describes their implementations.

Figure 10: Research Objects Digital Library internal component diagram

4.1 The interfaces

The main system level interface of RODL is a set of REST APIs, including the RO API28 and the RO EvolutionAPI29.The RO API, also called the RO Storage and Retrieval API, defines the formats and links used to create andmaintain Research Objects in the digital library. It is aligned with the Research Object model that is used todefine Research Objects, and so it recognizes concepts such as aggregations, annotations and folders. TheResearch Object model ontology (see Section 2) is used to specify relations between different resources.

28http://wf4ever-project.org/wiki/display/docs/RO+API+629http://wf4ever-project.org/wiki/display/docs/RO+evolution+API

2013 © Copyright lies with the respective authors and their institutions.

Page 18: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

Page 18 of 29 Wf4Ever STREP FP7-ICT-2007-6 270192

Given that the semantic metadata is an important component of a Research Object, the RODL supportscontent negotiation for the metadata resources, including formats such as RDF/XML, Turtle and TriG.The RO Evolution API defines the formats and links used to change the lifecycle stage of a Research Object,most importantly to create an immutable snapshot or archive from a mutable live Research Object, as wellas to retrieve the evolution provenance of a Research Object. The API follows the Research Object evolutionmodel (see Section 2), which is most visible in the evolution metadata that are generated for each statetransition.Additionally, RODL provides a SPARQL endpoint that allows performing SPARQL queries over HTTP to themetadata of all stored Research Objects. It also implements the Notification API30, which defines links usedto retrieve Atom feeds with notifications of events about any Research Object. For searching the contentsof Research Objects a Solr REST API and the OpenSearch APIs are provided. Finally, RODL implements acustom User Management API31 for registering users and generating OAuth 2 access tokens, providing theoption of extending it with an access control layer in the future.

Figure 11: The sequence diagram for creating a Research Object in RODL

Figure 12: The sequence diagram for aggregating a new resource in a Research Object

4.2 The implementation

One of the main design challenges related to the implementation of RODL was the need to support both live,dynamically changing Research Objects as well as immutable snapshots that are intended for a long-termpreservation. With this in mind, the RODL has a modular structure that comprises the access components,the long-term components and the controller that manages the flow of data (see figure 10). For immutableResearch Objects, they are stored in the long-term preservation repository once they are created. Thelive Research Objects, on the other hand, are pushed asynchronously after every change or periodically,depending on the configuration.

30http://wf4ever-project.org/wiki/display/docs/Notification+API31urlhttp://wf4ever-project.org/wiki/display/docs/User+Management+2

Page 19: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 19 of 29

The access components are the storage backend - dLibra32 - and the semantic metadata triplestore. dLibraprovides file storage and retrieval functionalities, including file versioning and consistency checking. It has abuilt-in text search engine and it manages users and controls their access rights. It allows organizing storedobjects into hierarchical structures and associating metadata at the level of object aggregations. It is alsopossible to use a built-in module for storing Research Objects directly in the filesystem.The semantic metadata is additionally parsed and stored in the triplestore backed by Jena TDB33. Jena TDBis an actively developed RDF store implementation, which provides good support for transactions, querying,caching and using named graphs. The use of a triplestore helps in RODL internal data processing and offersa standard query mechanism for RODL clients. It also provides a flexible mechanism for storing metadataabout any component of a Research Object that is identiable via a URI, which apart from workflows and otherresources, may include parts of workflows or external resources (e.g. web services, data sources).

Figure 13: The sequence diagram for creating a snapshot or release from the RO Evolution API clientperspective

The UML sequence diagrams (Figures 11,12) illustrate the interactions between the controller, the storagebackend and the triplestore for the basic operations of creating a Research Object and aggregating resourcesto it. Creating immutable snapshots of Research Objects is a more complex process which involves copyingthe resources, recording their provenance, optional modifications by the user and finally releasing as a pub-lished, immutable object. Figure 13 shows how RODL clients can perform these steps via the RO Evolution

32http://dlab.psnc.pl/dlibra/33http://jena.apache.org/

2013 © Copyright lies with the respective authors and their institutions.

Page 20: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

Page 20 of 29 Wf4Ever STREP FP7-ICT-2007-6 270192

API. Figures 14,15 present the interaction between internal RODL components when performing the processof creating the snapshot.

Figure 14: The sequence diagram for preparing the snapshot of a Research Object

Figure 15: The sequence diagram for finalizing the snapshot and making it immutable

Figure 16: The sequence diagram for storing Research Objects in dArceo

The long-term preservation component is built on dArceo34 - a system for long-term preservation of digitalobjects developed by PSNC. dArceo stores the objects and monitors their quality, alerting the administratorsif necessary (see Figure 16). The standard monitoring activities include file format decay alerts and fixitychecking but can be enhanced using a plugin mechanism. In case of RODL, dArceo periodically monitors

34http://dlab.psnc.pl/darceo/

Page 21: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 21 of 29

the quality of Research Objects by calling the Checklist Evaluation and Stability Services35. If a changein quality is detected, notifications are generated as Atom feeds in compliance with the Notification APImentioned above. This helps detect and prevent workflow decay which occurs when an external resource orservice used by the workflow becomes unavailable or is otherwise behaving differently.dArceo gives the possibility to define migration plans that allow to perform a batch update of resources fromone format to another, when necessary. In case of workflows, this may be applied for instance when a flatTaverna t2flow format should be converted to a complex scufl2 format (which, N.B. uses the Research Objectmodel similarly to Research Objects). Another case could be a batch update of workflows that depend on amalfunctioning external resource.Objects in dArceo can be stored on a range of backends, including specialized preservation repositories suchas the Platon service36, storing data in geographically distributed copies and guaranteeing their consistency.A running instance of the RODL is available for testing37. At the moment of writing, it holds more than 1300Research Objects.

4.3 RODL clients

The reference client of RODL is the RO Portal, developed alongside RODL to test new features and exposeall available functionalities. It is running as a web application38. Its main features are Research Objectexploration and visualization; it also allows creating user accounts in RODL and generating access tokensfor other clients. The RO Portal uses all APIs of RODL. The development version of myExperiment (seeSection 3.2) uses RODL as a backend for storing packs. It uses the RO API. Finally, the RO Manager (seeSection 3.1) is a command line tool that is primarily used to manage a Research Object stored on a localdisk. It allows to push a Research Object to RODL via the RO API, as well as converting it into a snapshot inRODL.

5 Abstracting and Indexing Workflows

Using the research object model, one may use any vocabulary that fits their purpose. In this section, we pro-vide two models that we developed for annotating workflows. The first one is aimed at abstracting workflows(Section 5.1), and the second at indexing workflows (Section 6).

5.1 Workflow Abstraction using Motifs

Workflows serve a dual function: first, as detailed documentation of the scientific method used for an ex-periment (i. e. the input sources and processing steps taken for the derivation of a certain data item), andsecond, as re-usable, executable artifacts for data-intensive analysis. Scientific workflows are composed of avariety of data manipulation activities such as Data Movement, Data Transformation, Data Analysis and DataVisualization to serve the goals of the scientific study. The composition is done through the constructs madeavailable by the workflow system used, and is largely shaped by the function undertaken by the workflow andthe environment in which the system operates.A major difficulty in understanding workflows is their complex nature. A workflow may contain several scien-tifically significant analysis steps, combined with other Data Preparation or result delivery activities, and indifferent implementation styles depending on the environment and context in which the workflow is executed.This difficulty in understanding stands in the way of reusing workflows.

35http://wf4ever-project.org/wiki/display/docs/RO+checklist+evaluation+API,http://wf4ever-project.org/wiki/display/docs/Stability+Evaluation+API

36http://www.platon.pionier.net.pl/37http://sandbox.wf4ever-project.org/rodl/38http://sandbox.wf4ever-project.org/portal

2013 © Copyright lies with the respective authors and their institutions.

Page 22: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

Page 22 of 29 Wf4Ever STREP FP7-ICT-2007-6 270192

As a first step towards addressing this issue [GAB+12] describes a catalogue of domain independent con-ceptual abstractions for workflow steps called scientific Workflow Motifs. The catalogue was built based onan empirical analysis performed over 260 workflow descriptions from Taverna [WHF+13], Wings [GRK+11],Galaxy [GNT10] and Vistrails [CFS+06b]. Motifs are provided through i) a characterization of the kinds ofdata-oriented activities that are carried out within workflows, which are referred to as Data-Operation motifs,and ii) a characterization of the different manners in which those activity motifs are realized/implementedwithin workflows, referred to as Workflow-Oriented motifs. Figure 17 shows an example of a Taverna work-flow with its motifs highlighted. The workflow transfers data files containing proteomics data to a remoteserver and augments several parameters for the invocation request. Then the workflow waits for job comple-tion and inquires about the state of the submitted warping job. Once the inquiry call is returned the resultsare downloaded from the remote server.

Figure 17: Sample motifs in a Taverna workflow for functional genomics.

This section describes the Workflow Motifs ontology39, an OWL 2 encoding ot the aforementioned motifcatalogue. The goal of this ontology is to provide the means to annotate workflows and their steps with themotifs of the vocabulary, without setting any restriction on how the workflows are defined themselves.

5.1.1 Representing Motifs

Figure 18 shows an overview of the class taxonomy of the ontology. The class Motif represents the dif-ferent classes of motifs identified in the catalog. This class is categorized into two specialized sub-classesDataOperationMotif and WorkflowMotif, which are sub-classed following the taxonomy represented in[GAB+12].

39http://purl.org/net/wf-motifs

Page 23: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 23 of 29

Figure 18: Diagram showing an overview of the class taxonomy of the motif OWL ontology.

The ontology provides three properties to link motifs to workflow specifications and their fragments.The hasMotif property associates workflows and their operations with their motifs. The propertieshasDataOperationMotif and hasWorkflowMotif allow annotating workflows and their steps with morespecificity. These properties have no domain specified, as different workflow models may use different vo-cabularies for describing workflows and their parts.

5.1.2 Representing Workflows and Workflow Steps

Workflows may be represented with different models and vocabularies like OPMW [GG11], P-Plan [GG12] orD-PROV [MDB+13]. While providing an abstract and consistent representation of the workflow is not a pre-requisite to the usage of the Motif ontology, we consider it a best-practice to use a model that is independentfrom any specific workflow language or technology. An example of annotation using the wfdesc model isgiven in Figure 19 by showing the annotations of part of the Taverna workflow shown in Figure 17.The annotations encoded using the Motif Ontology could be used in a variety of applications. By providing ex-plicit semantics on the data processing characteristic and the implementation characteristic of the operations,annotations improve understandability and interpretation. Moreover, they can be used to facilitate workflowdiscovery. For example, the user can issue a query to identify workflows that implement a specific flow of

2013 © Copyright lies with the respective authors and their institutions.

Page 24: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

Page 24 of 29 Wf4Ever STREP FP7-ICT-2007-6 270192

Figure 19: Subset of the annotations of the Taverna workflow shown in Figure 17 using the wfdesc model.

data manipulation and transformation (e.g., return the workflows in which data reformatting is followed bydata filtering and then data visualization). Having information on characteristics of workflow operations allowfor manipulation of workflows to generate summaries [ABGK13] of workflow descriptions or their executiontraces.

6 Indexing Workflows

This section shows how workflows have been indexed using a generalized trie structure40 and a seri-alization process. A workflow wf can be defined as a set of Processes

wf

and Params

wf

| wf =Processes

wf

[ Params

wf

where the Processes

wf

are the definition of specific tasks to be executedand uses wfdesc:Process for its description, and Params

wf

defines the inputs and outputs of those tasksdescribed by wfprov:Input and wfprov:Output which are subclasses of wfdesc:Parameter. On the other hand,a trie structure is an ordered tree data structure that allows to store dynamically a vector or an associativearray where the keys are the values being stored itself. The main characteristic of this structure is absenceof tree nodes being used for storing the key associated with that node but the position of the node withinthe tree defines the key. Another characteristic of this structure is that all the descendants of a node have acommon prefix and therefore allows indexing simultaneously complete or partial paths to a specific node.

For our purposes of indexing a workflow, or generally speaking any DAG, by using a trie structure as asequence of items which represents the DAG partially or completely for later accessing, a preprocessing ofthe set of workflows is needed as a first step. For accomplishing with this preprocessing goal of adapting aDAG to the trie indexing structure we have used one of the possible topological orders of a DAG. It is knownthat any DAG has at least one topological ordering which assures that if a vertex u is linked through anedge to vertex v, then after sorting it the vertex u will come before v in ordering. This topological orderingdoes not have to be unique and therefore a DAG could be defined by multiple topological sorts. Similarlyto [MAYU03], in order to avoid the possibility of having similar workflows defined in different ways within thesame indexing structure, we have chosen the lexicographical order of processes as common criteria to beapplied. Therefore, the chosen overall used criteria for ordering the DAG sequentially has been both thetopological and lexicographical orders.

The figure 20 shows how the "Extract proteins using a gi - output as fasta file"41 example which was obtainedfrom ProvBench data set [BZG+13], is sorted by applying our criteria. One of the main advantages of thisapproach is that after transforming the DAG workflow into a sequential linked set of resources it can beindexed almost instantaneously and the time for searching and exact matching is linear with the size of the

40Trie comes from the word reTrieval indicating the process of information accessing but it is also called suffix tree.41http://www.myexperiment.org/workflows/1182.html

Page 25: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 25 of 29

tree which is also directly related to the vocabulary size of the domain.

Figure 20: Sorting the "Extract proteins using a gi - output as fasta file" example obtained from ProvBench

6.1 Applications and Implementation

In order to provide scientists with searching and design capabilities we have implemented two different webservices which makes use of the above introduced indexing structure for accessing to workflows and rec-ommend next steps given the previous ones. It is worth highlighting that the implemented indexing structurecould be also used for other purposes as the discovery of frequent patterns by using the collected statisticalinformation and mining the semantic trie structure which would take full advantage of the intrinsic sequentialordering of the trie structure.The implementation of both services (searching and next step recommendation) has been done applying theguidelines of the Wf4Ever project and using the wfprov and wfdesc ontologies. Both services are REST andcan be called by an HTTP GET method which upon on ACCEPT headers may return XML or JSON formats.Also for implementation purposes we have encapsulated the trie indexing structured in order to provided theabove presented services. The interaction between the mentioned components can be seen at Figure 21.

6.1.1 Searching

This service42 searches for those workflows which contain a specific sequence of processes providing onreal time a list including all of them. The service accepts an array of processes’ names as input parameterprocess[] (e.g. process?=p1&p2&p3). Therefore the general call would be of the form43:

The output is an XML or JSON structure with the following attributes:

• Process_id: is the name of the process or processes used in the query.

• freq: is the number of times that the sequence of processes appears in other workflows.

• URIs: are the ROs’ URIs which contain workflows where the requested sequence appears.42http://sandbox.wf4ever-project.org/wfabstraction/rest/search43http://sandbox.wf4ever-project.org/wfabstraction/rest/search{?process[]}

2013 © Copyright lies with the respective authors and their institutions.

Page 26: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

Page 26 of 29 Wf4Ever STREP FP7-ICT-2007-6 270192

Figure 21: Sequence diagram for workflows indexing, searching, and recommending next step processes.

6.1.2 Next step recommendation

The proposed trie structure captures the provenance of workflows execution associated to scientific experi-ments allowing their indexation based on the temporal information that they contain. The trie structure is alsoupdated to gather the needed statistics for mining the execution of workflows and provide the recommendednext process based on how frequent a pattern of use occurs. So far we collected the number of times aprocess occurs in the dataset and also the probability associated to that pattern given the set of patternsof the same size. The fact that the exact matching searching is linear with the size of the tree (whichis dependent of the use vocabulary, in our case the domain has been restricted to the set of processesincluded in ProvBench [BZG+13]) makes it very suitable for this type of applications. The implementedservice44 accepts an array of processes’ names as input parameter process[] (e.g. process?=p1&p2&p3)and its general call would be of the form45:

The output is an XML or JSON structure with different attributes:

• Id: is the name of the recommended process for that input query.

• freq: is the number of times that it appears in different workflows.

• prob: is the probability of the given recommendation taking into account the whole set of possible nextsteps.

44http://sandbox.wf4ever-project.org/wfabstraction/rest/recommend45http://sandbox.wf4ever-project.org/wfabstraction/rest/recommend{?process[]}

Page 27: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 27 of 29

7 Summary

We have presented in this deliverable the final version Research Object model defined within Wf4Ever, asa family of ontologies. We also presented the tools that were built on the model in order to facilitate thecreation, curation and sharing of Research Objects, namely, the Research Object Manager (RO Manager),a command line tool for creating, displaying and manipulating Research Objects, RODL, which acts as aback-end, with two storage alternatives: a digital repository to keep the content, as a triple store to managethe metadata content, and the myExperiment virtual research environment, which was extended to allowend-users to create, upload, share and curate Research Objects. We also presented two models that can beused for annotating workflows with the objective to abstract and index them.Our ongoing and future work aims to advertise and disseminate the Research Object model and the toolsdeveloped around it. In this respect, it is worth mentioning that we have launched a website dedicated toResearch Objects46, with examples that assist prospective adopters in understanding the model usage andbenefits.

46http://www.researchobject.org/

2013 © Copyright lies with the respective authors and their institutions.

Page 28: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

Page 28 of 29 Wf4Ever STREP FP7-ICT-2007-6 270192

References

[ABGK13] P. Alper, K. Belhajjame, C. Goble, and P. Karagoz. Small is beautiful: Summarizing scientificworkflows using semantic annotations. In Proceedings of the IEEE 2nd International Congresson Big Data (BigData 2013), Santa Clara, CA, USA, June 2013.

[BZG+13] K. Belhajjame, J. Zhao, D. Garijo, A. Garrido, S. Soiland-Reyes, Alper P., and Corcho. O. Work-flow prov-corpues based on taverna and wings. In EDBT/ICDT’13. ACM Press, 2013.

[CFS+06a] S. P. Callahan, J. Freire, E. Santos, C. E. Scheidegger, C. T. Silva, and . T. Vo. Vistrails: Visual-ization meets data management. In In ACM SIGMOD, pages 745–747. ACM Press, 2006.

[CFS+06b] S. P. Callahan, J. Freire, E. Santos, C. E. Scheidegger, C. T. Silva, and H. T. Vo. Vistrails:Visualization meets data management. In In ACM SIGMOD, pages 745–747. ACM Press, 2006.

[COC+11] P. Ciccarese, M. Ocana, L. J Garcia Castro, S. Das, and T. Clark. An open annotation ontologyfor science on web 3.0. J Biomed Semantics, 2(Suppl 2):S4, May 2011.

[DRHPP13] David De Roure, Piotr Hołubowicz, Kevin Page, and Raúl Palma. D1.3v2: Wf4ever architecture- phase ii. Technical report, University of Oxford, March 2013.

[G+05] B. G. et al. Galaxy: a platform for interactive large-scale genome analysis. Genome Research,15(10):1451–1455, Oct 2005.

[GAB+12] D. Garijo, P. Alper, K. Belhajjame, O. Corcho, Y. Gil, and C. Goble. Common motifs in scientificworkflows: An empirical analysis. In 8th IEEE International Conference on eScience 2012,8th IEEE International Conference on eScience 2012, Chicago, 2012. IEEE Computer SocietyPress, USA.

[GC+13a] E. Garcia-Cuesta et al. Design, implementation and deployment of Workflow Integrity and Au-thenticity Maintenance components - Phase II. Deliverable D4.2v2, Wf4Ever Project, 2013.

[GC+13b] R. Gonzalez-Cabero et al. Design, implementation and deployment of Workflow Evolution,Sharing and Collaboration components - Phase II. Deliverable D3.2v2, Wf4Ever Project, 2013.

[GG11] D. Garijo and Y. Gil. A new approach for publishing workflows: Abstractions, standards, andlinked data. In Proceedings of the 6th workshop on Workflows in support of large-scale science,Proceedings of the 6th workshop on Workflows in support of large-scale science, pages 47–56,Seattle, 2011. ACM.

[GG12] D. Garijo and Y. Gil. Augmenting PROV with Plans in P-PLAN: Scientific Processes as LinkedData. In Second International Workshop on Linked Science: Tackling Big Data (LISC), held inconjunction with the International Semantic Web Conference (ISWC), Boston, MA, 2012.

[GNT10] J. Goecks, A. Nekrutenko, and J. Taylor. Galaxy: a comprehensive approach for supportingaccessible, reproducible, and transparent computational research in the life sciences. GenomeBiology, 11(8):R86, 2010.

[GRK+11] Y. Gil, V. Ratnakar, J. Kim, et al. Wings: Intelligent workflow-based design of computationalexperiments. IEEE Intelligent Systems, 26(1):62–72, 2011.

[LAB+06] Bertram L., Ilkay A., Chad B., Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A. Lee, JingTao, and Yang Zhao. Scientific workflow management and the kepler system. Concurrency andComputation: Practice and Experience, 18(10):1039–1065, 2006.

[LSM+13] T. Lebo, S. Sahoo, D. McGuinness, K. Belhajjame, J. Cheney, D. Corsar, D. Garijo, S. Soiland-Reyes, S. Zednik, and J. Zhao. Prov-o: The prov ontology. Technical report, W3C, 2013.

Page 29: D2.2v2 Design, implementation and deployment of workflow ......D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 7 of 29 1 Introduction

D2.2v2 Design, implementation and deployment of workflow lifecycle management components - Phase IIPage 29 of 29

[MAYU03] Akiyoshi Matono, Toshiyuki Amagasa, Masatoshi Yoshikawa, and Shunsuke Uemura. An Index-ing Scheme for RDF and RDF Schema based on Suffix Arrays. In Proceedings of SWDB 2003,pages 151–168, 2003.

[MDB+13] P. Missier, S. Dey, K. Belhajjame, V. Cuevas, and B. Ludaescher. D-PROV: extending the PROVprovenance model with workflow structure. In Procs. TAPP’13, Lombard, IL, 2013.

[RGS09] D. De Roure, C. A. Goble, and R. Stevens. The design and realisation of the myExperimentvirtual research environment for social sharing of workflows. Future Generation Comp. Syst.,25(5):561–567, 2009.

[Shi07] Matthew Shields. Control- versus data-driven workflows. In IanJ. Taylor, Ewa Deelman, Den-nisB. Gannon, and Matthew Shields, editors, Workflows for e-Science, pages 167–173. SpringerLondon, 2007.

[WHF+13] K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen, S. Soiland-Reyes,I. Dunlop, A. Nenadic, P. Fisher, J. Bhagat, K. Belhajjame, F. Bacall, A. Hardisty, A. Nieva de laHidalga, M. P. Balcazar Vargas, S. Sufi, and C. Goble. The taverna workflow suite: designingand executing workflows of web services on the desktop, web or in the cloud. Nucleic AcidsResearch, 2013.

2013 © Copyright lies with the respective authors and their institutions.