Attendees Research Statements NSF Workshop on Challenges ... · Attendees Research Statements NSF...

27
1 Attendees Research Statements NSF Workshop on Challenges of Scientific Workflows National Science Foundation, Arlington, VA May 1-2, 2006 Prior to the workshop, the attendees were asked to submit a one-page research statement describing current work (including citations and web pointers) and outlining future topics of research in scientific workflows. This document is a compilation of those statements, grouped according to the four planned discussion topics for the workshop. 1) Applications and requirements: What are the requirements of future applications? What new capabilities are needed to support emerging applications? Geoffrey Fox [11], Indiana University (lead) Constantinos Evangelinos, Massachusetts Institute of Technology Ian Foster [10], University of Chicago and Argonne National Laboratory Jeffrey Grethe [15], University of California San Diego Ed Seidel, Louisiana State University Ashish Sharma, Ohio State University Alex Szalay [28], John Hopkins University 2) Dynamic workflows and user steering: What are the challenges in supporting dynamic workflows that need to evolve over time as execution data become available? What kinds of techniques can support incremental and dynamic workflow evolution due to user steering? Carole Goble [13], University of Manchester (lead) Mark Ackerman [4], University of Michigan Mark Ellisman [7] , University of California San Diego Juliana Freire [9], University of Utah Dennis Gannon [12], Indiana University Karen Myers [23], SRI International Walt Scacchi [25], University of California, Irvine

Transcript of Attendees Research Statements NSF Workshop on Challenges ... · Attendees Research Statements NSF...

1

Attendees Research Statements

NSF Workshop on Challenges of Scientific WorkflowsNational Science Foundation, Arlington, VA

May 1-2, 2006

Prior to the workshop, the attendees were asked to submit a one-page research

statement describing current work (including citations and web pointers) and outlining

future topics of research in scientific workflows. This document is a compilation of those

statements, grouped according to the four planned discussion topics for the workshop.

1) Applications and requirements: What are the requirements of future applications?

What new capabilities are needed to support emerging applications?

• Geoffrey Fox [11], Indiana University (lead)• Constantinos Evangelinos, Massachusetts Institute of Technology• Ian Foster [10], University of Chicago and Argonne National Laboratory• Jeffrey Grethe [15], University of California San Diego• Ed Seidel, Louisiana State University• Ashish Sharma, Ohio State University• Alex Szalay [28], John Hopkins University

2) Dynamic workflows and user steering: What are the challenges in supporting

dynamic workflows that need to evolve over time as execution data become available?

What kinds of techniques can support incremental and dynamic workflow evolution due

to user steering?

• Carole Goble [13], University of Manchester (lead)• Mark Ackerman [4], University of Michigan• Mark Ellisman [7] , University of California San Diego• Juliana Freire [9], University of Utah• Dennis Gannon [12], Indiana University• Karen Myers [23], SRI International• Walt Scacchi [25], University of California, Irvine

2

3) Data and workflow descriptions: How can workflow descriptions be improved to

support usability and scalability? How to describe data produced as part of the

workflows? What provenance information needs to be tracked to support scalable data

and workflow discovery?

• Jim Myers [22], NCSA (lead)• Ilkay Altintas [5], SDSC• Roger Barga, [6], Microsoft• Yolanda Gil [3], USC Information Sciences Institute• Alexander Gray [14], Georgia Tech• Jim Hendler [16], University of Maryland• Craig Knoblock [18], USC Information Sciences Institute• Luc Moreau [21], University of Southampton• Amit Sheth [27], University of Georgia

4) System-level management: What are the challenges in supporting large-scale

workflows in a scalable and robust way? What changes are needed in existing software

infrastructure? What new research needs to be done to develop better workflow

management systems?

• Miron Livny [20], University of Wisconsin (lead)• Ewa Deelman [2], USC Information Sciences Institute• Francisco Curbera, IBM• Thomas Fahringer [8], University of Innsbruck• Carl Kesselman [17], USC Information Sciences Institute• Chuck Koelbel [19], Rice University• Gregor Von Laszewski [29]. ANL

3

Geoffrey Fox

Workflow

Geoffrey Fox Indiana UniversityMay 1-2 2006

General links are

1. GGF10 workflow meeting summary http://grids.ucs.indiana.edu/ptliupages/presentations/ggfsummary-mar9-04.ppt

2. GGF10 workflow special issue editorial http://grids.ucs.indiana.edu/ptliupages/publications/Workflow-overview.pdf

3. GGF10 workflow special issue http://www.cc-pe.net/iuhome/workflow2004index.html or published athttp://www3.interscience.wiley.com/cgi-bin/jissue/105558633 (except for editorial above)

4. Jia Yu and Rajkumar Buyya, A Taxonomy of Workflow Management Systems for Grid Computing,Technical Report, GRIDS-TR-2005-1, Grid Computing and Distributed Systems Laboratory,University of Melbourne, Australia, March 10, 2005.http://www.gridbus.org/reports/GridWorkflowTaxonomy.pdf

5. Our work on scripting language based workflow http://www,hpsearch.org6. Managing Dynamic Services and flows http://grids.ucs.indiana.edu/ptliupages/publications/clade06-

ManagingGridMessagingMiddleware-UPDATED.pdf7. Streaming seismic sensor applications http://grids.ucs.indiana.edu/ptliupages/publications/skg2005-

FGCS-Extention-re-re-re-revised.pdf8. DoD Net-Centric Environments and Grids http://grids.ucs.indiana.edu/ptliupages/publications/gig9. Collaboration workflows http://grids.ucs.indiana.edu/ptliupages/publications/soa-voip-05.doc10. Fault Tolerant Streams http://www.naradabrokering.org11. Fault Tolerant system metadata http://www.opengrids.org/wscontext/

1) Applications and requirementsThere is I think a need for a benchmark set for workflow playing same role as NAS benchmark did forparallel computing. It should be specified in “pencil and paper” so it can be implemented in the manydifferent languages and GUIs. Further one should make the “network and computational size” of thebenchmarks variable. The set should cover a range of application classes as given in links 1-4 above.

2) Dynamic workflows and user steeringMuch of our work is for dynamic flow-based grids such as those envisaged for the Global information Grid(link 8) and for sensor nets (link 7). Note here there are complex workflows but parts of these are generatedautomatically as for example when a new sensor joins the system or when the user decides to insert a newfilter service. Here one doesn’t need a language to specify the workflow (except to record it for prosperity).Rather one needs to be able to “rewire” the data to add a new stream or to flow a stream through a newfilter. Collaboration workflows (link 9) are of this class. There should be more research on “scheduling”and “routing” of streams of messages flowing between service nodes of a workflow

3) Data and workflow descriptionsI often wonder out loud why the field of workflow does not relate to related well established areas such asdistributed programming and the many existing dataflow systems that follow the old model of AVS(visualization) and Khoros (image processing). We use Scripting (JavaScript in link 5) to specifyworkflows and management strategies.

4) System-level managementWe have focused (link 6) on the general problem of service management which is (implicitly) part of anyworkflow runtime with an emphasis of scalable fault-tolerance. We have some key constructs – faulttolerant streams (link 10), fault tolerant metadata store (link 11), largely stateless (except for performanceenhancing caches) dynamic managers and the “managees” (services in the workflow).

4

Ian Foster

See Szalay et al. statement:

“The Importance of Data Locality in Distributed Computing Applications”Alex Szalay, Julian Bunn, Jim Gray, Ian Foster, Ioan Raicu

See also Deelman et al. statement:

“Community Process as Workflow”Ewa Deelman, Ian Foster, Carl Kesselman, Mike Wilde

5

Jeffrey S. GretheBIRN Coordinating CenterUniv of Calif, San Diego

Research Statement for the Workshop on Challenges of Scientific Workflows

The Biomedical Informatics Research Network (BIRN) is an infrastructure project of theNational Institutes of Health. A main objective of BIRN is to foster large-scalecollaborations in biomedical science by utilizing the capabilities of the emergingcyberinfrastructure (Grethe et al., 2005). Currently, the BIRN involves a consortium ofmore than 30 universities and 40 research groups participating in test bed projectscentered around brain imaging of human neurological disease and associated animalmodels. The promise of the BIRN is the ability to test new hypotheses through theanalysis of larger patient populations and unique multi-resolution views of animal modelsthrough data sharing and the integration of site independent resources for collaborativedata refinement.

The challenge in such research is that workflows, as defined by domain scientists, typically representthe end-to-end application process that often includes a heterogeneous mix of experimental processesand the corresponding collection of distinct workflows (information gathering, bench/laboratoryexperimentation, computation, analysis, visualization, etc.) An additional complexity is that withincollaborative environments such as BIRN, researchers are utilizing multiple workflow environments(e.g. LONI pipeline, Kepler, fBIRN Image Processing Stream) and it may be impossible tostandardize on a single environment due to the requirements of a specific research community orstudy. Therefore, it has become increasingly important to provide an environment in whichresearchers can manage the complete end-to-end scientific process by utilizing and combining pre-constructed application workflows (regardless of the workflow platform) via a unified portal interface.As communities develop conventions and best practices for the processing of certain data (e.g. pre-processing and first level analyses of functional MRI data) it will be important that a researcher is ableto integrate these components into her scientific process, thereby increasing interoperability ofapplication workflows across communities and projects. The BIRN Workflow Working Group iscreating such a workflow management engine that is being built as a JSR-168 compliant GridSphere-based portlet so that it may be used by the broader scientific community.

Sharing application workflows broadly poses additional concerns and challenges. First, it will benecessary to facilitate the ability for researchers to discover and utilize these workflows and fordevelopers to incorporate these workflows into their application environments. This will require ahigh-level workflow description language that provides an appropriate description of the workflowand exposes the important parameters to the user with appropriate usage information. Second, asworkflows are developed within communities for specific tasks it will be important to provide aframework for validation and detailed analysis of their functionality. As part of this analysis, it will benecessary to evaluate the components of these workflows and build a profile of the contributions ofthe individual component applications and their interactions in specific settings. Utilizing thisinformation, a researcher can be provided with a comprehensive guide and set of best practices thatwill allow them to construct optimized and improved workflows.

Grethe, J.G., Baru, C., Gupta, A., James, M., Ludaescher, B., Martone, M., Papadopoulos, P.M., Peltier,S.T., Rajasekar, A., Santini, S., Zaslavsky, I.N., Ellisman, M.H. (2005) “BIOMEDICAL INFORMATICSRESEARCH NETWORK : Building A National Collaboratory To Hasten The Derivation of NewUnderstanding and Treatment of Disease” in From Grid to Healthgrid: Proceedings of Healthgrid 2005Solomonides, McClatchey, Breton, Legré and Nørager (ed.), Amsterdam, IOS Press.

6

Alex Szalay

The Importance of Data Locality in Distributed Computing ApplicationsAlex Szalay, Julian Bunn, Jim Gray, Ian Foster, Ioan Raicu

Current grid computing environments are primarily built to support large-scale batchcomputations, where turnaround may be measured in hours or days – their primary goal is notinteractive data analysis. While these batch systems are necessary and highly useful for therepetitive ‘pipeline processing’ of many large scientific collaborations, they are less useful forsubsequent scientific analyses of higher level data products, usually performed by individualscientists or small groups. Such exploratory, interactive analyses require turnaround measured inminutes or seconds so that the scientist can focus, pose questions and get answers within onesession. The databases, analysis tasks and visualization tasks involve hundreds of computers andterabytes of data. Of course this interactive access will not be achieved by magic – it requiresnew organizations of storage, networking and computing, new algorithms, and new tools.

As CPU cycles become cheaper and data sets double in size every year, the main challenge for arapid turnaround is the location of the data relative to the available computational resources –moving the data repeatedly to distant CPUs is becoming the bottleneck. There are largedifferences in IO speeds from local disk storage to wide area networks. A single $10K servertoday can easily provide a GB/sec IO bandwidth, that requires a 10Gbit/sec network connectionto transmit. We propose a system in which each ‘node’ (perhaps a small cluster of tightly coupledcomputers) has its own high speed local storage that functions as a smart data cache.

Interactive users measure a system by its time-to-solution: the time to go from hypothesis toresults. The early steps might move some data from a slow long-term storage resource. But theanalysis will quickly form a working set of data and applications that should be co-located in ahigh performance cluster of processors, storage, and applications.

A data and application scheduling system can observe the workload and recognize data andapplication locality. Repeated requests for the same services lead to a dynamic rearrangement ofthe data: the frequently called applications will have their data ‘diffusing’ into the grid, mostresiding in local, thus fast storage, and reach a near-optimal thermal equilibrium with theircompetitor processes for the resources. The process arbitrating data movement is aware of allrelevant costs, which include data movement, computing, and starting and stopping applications.

Such an adaptive system can respond rapidly to small requests, in addition to the backgroundbatch processing applications. We believe that many of the necessary components to build asuitable system are already available: they just need to be connected following this newphilosophy. The architecture requires aggressive use of data partitioning and replication amongcomputational nodes, extensive use of indexing, pre-computation, computational caching,detailed monitoring of the system and immediate feedback (computational steering) so thatexecution plans can be modified. It also requires resource scheduling mechanisms that favorinteractive uses.

We have identified a few possible scenarios from the anticipated use of the National VirtualObservatory data that are currently used to experiment with this approach. These include imagestacking services, and fast parallel federations of large collections. These experiments are alreadytelling us that, in contrast to traditional scheduling, we need to schedule not just individual jobsbut workloads of many jobs.

The key to a successful system will be “provisioning”, i.e., a process that decides how manyresources to allocate to different workloads. It can run the stacking service on 1 CPU, or 100: thenumber to be allocated at any particular time will depend on the load (or expected load) and onother demands.

7

Carole Goble

Scientific Workflows are for ScientistsCarole Goble, The myGrid/Taverna Project, The University of Manchester, UK [email protected]

myGrid (http://www.mygrid.org.uk) is a UK e-Science project providing middleware services to assistbioinformaticians to support exploratory, data-intensive, in silico experiments in molecular biology.Workflows are used to orchestrate, access and interoperate a large number of public databases andapplications, and manage those experiments and their outcomes (including the resultant data products, theirprovenance and experiment conclusions) using semantic-based metadata and data technologies [4]. Notethat the workflow processes are third party applications, rather than computational jobs to be scheduled.Our workflow environment and workbench, Taverna [2], enables scientists to design and executeworkflows, providing access to over 3000 bio-resources, using semantic technologies to describe, discoverand publish their function and properties [3]. The services are a mixture of web services, grid services, Javaapplications, database queries and scripts. Taverna has proved popular. The current version, 1.3.1, has had2200 downloads in the past four months, a component of a range of UK and European e-Science projects,and has an active Open Source community of contributors. Taverna is used for gene alerting, gene andprotein sequence annotation, proteomics, functional genomics, chemoinformatics, systems biology, proteinstructure prediction applications and medical imaging. Workflows have been used to identify a mutationassociated with the autoimmune disorder Graves’ Disease in the I kappa B-epsilon gene [6] and build thefirst complete and accurate map of the region of chromosome 7 involved in Williams-Beuren Syndrome[5]. Taverna is now part of the UK’s Open Middleware Infrastructure Institute UK, which means it has adedicated team of software engineers to develop it to production quality over the next three years. Taverna1.4, our first production release, will be available in September 2006.

The design of Taverna has been driven by: the users we wish to support; the nature of the existingresources they wish to orchestrate; and the type of in silico experiment they wish to perform. Our emphasisis on building workflows that link together third party applications (both remote and local) that are familiarto the scientist, using a language and tools designed for the scientist. We support two classes of user (a)bioinformaticians with deep knowledge of the scientific functionality of the resources they want to linktogether with little knowledge of specific middleware and (b) service providers of resources with poor ormissing programmatic interfaces encompassing heterogeneous and semi-structured data exposed using adiverse range of mechanisms, and no prescribed standards for formats, APIs or delivery platforms. Theusers and the service providers are loosely coupled and independent of each other.

Our users are often members of small, poorly resourced, research groups that begin their analyses as lowvolume ad-hoc experiments that are incrementally and rapidly prototyped in an exploratory way. Some areeffectively disposable, whilst others develop into production workflows which will be executed repeatedly.Scientific benefit comes from combining results from many different workflow runs, so it is essential tomanage the data produced by each experiment. Scientists frequently undertake similar analysis to that ofother groups, and therefore the workflow designed by one user is often suitable for adoption or adaptationby many others. By sharing workflows we propagate scientific know-how and best practice [1].

Our experiences lead us to the following position statements.Scientific workflows are for Scientists. Easily assembling workflows, finding services and adaptingprevious workflows is key. The user should be thinking about the workflow as an experiment, not itsexecution complexity. The descriptions of services have to be in user’s language, not WSDL. Theworkflow has to reflect the experiment not the services invocation interface. Workflow provenance that is alog of executions is great for debugging but not good for describing a scientific argument. Biologists havestrong opinions about the particular services that they wish to use; they do not accept substitutes.Automated workflow design is unlikely, unpopular, and undesirable. Listen to the Scientist.Deal with what is out there. We wanted to be able to use any service as it was presented, rather thanrequire service providers to jump through hoops to get incorporated. The world is full of services that arenot WSDL. This promiscuous approach meant we built up a large number of services, and its serviceavailability that makes a scientist want to build a workflow. Taverna caters for a variety of different service interfaces, and does not require adherence to a common universal type system. Easy forservice providers. Harder for middleware.Get your abstractions right. To both present a straight forward perspective to our users and yet cope withthe heterogeneous interfaces of third party services means a multi-tiered approach to resource discovery,workflow provenance recording and execution that separates application and user concerns from

8

operational and middleware concerns. For us this resulted in a three-tiered model and exposing our users toScufl, a workflow language for linking applications, not BPEL, which is assembler language for webservices.Workflows are part of the means to an end, not the end. You have got to incorporate workflows into awhole experimental lifecycle. A workflow should make it easier for scientists to describe and run theirexperiments in a structured, repeatable and verifiable way. However, they exist in a wider context ofscientific data management. The user is still at the centre, interacting with the workflows and the services,and interpreting the outcomes. Workflows are a resource in their own right, to be described, shared andreused [1]. It is essential that data produced by a workflow carries with it some record of how and why itwas produced i.e. the provenance of the data.Managing the results is the tricky problem. Our workflows compute new data and gather together pre-existing data, and work in the knowledge of MIME types but in ignorance of domain data types. The sameworkflow will be re-run over and over to cope with changes in the underlying data resources, and theresults (and provenance logs) compared and aggregated. The workflows may produce intermediate resultsall the way through and/or an end result data object. The problem is not workflow execution but theeffective management, presentation and analysis of the data and provenance that comes out. This is madeharder or easier by the design of the workflows, implying the need for best practice, and workflow patternbooks.There is more than one way to skin a cat. There will be no one workflow language or workflow system,as there is no one programming language or operating system. A spectrum of factors influence the adoptionof particular workflow language and tools: workflow language expressivity; workflow languageabstractions; user skills; user training; open access to workflow tools; data and provenance management;and the existence of a user community willing to share workflows and good practice in workflow design.Aim to interoperate.

References

1. Chris Wroe, Carole Goble, Antoon Goderis, Phillip Lord, Simon Miles, Juri Papay, Pinar Alper, LucMoreau Recycling workflows and services through discovery and reuse in Concurrency and Computation:Practice and Engineering, accepted for publication in March 2005 still in press

2. Tom Oinn, Mark Greenwood, Matthew Addis, M. Nedim Alpdemir, Justin Ferris, Kevin Glover, CaroleGoble, Duncan Hull, Darren Marvin, Peter Li, Phillip Lord, Matthew R. Pocock, Martin Senger, RobertStevens, Anil Wipat and Chris Wroe Taverna: Lessons in creating a workflow environment for the lifesciences in Concurrency and Computation: Practice and Engineering accepted for publication in 2005 stillin press

3. Phillip Lord, Pinar Alper, Chris Wroe, and Carole Goble Feta: A light-weight architecture for useroriented semantic service discovery in Proc of 2nd European Semantic Web Conference, Crete, 29 May – 1June 2005, Springer LNCS 3532

4. Jun Zhao, Chris Wroe, Carole Goble, Robert Stevens, Dennis Quan, Mark Greenwood, Using SemanticWeb Technologies for Representing e-Science Provenance in Proc 3rd International Semantic WebConference ISWC2004, Hiroshima, Japan, 9-11 Nov 2004, Springer LNCS 3298

5. Robert Stevens, Hannah J Tipney, Chris Wroe, Tom Oinn, Martin Senger, Phillip Lord, Carole A Goble,Andy Brass and May Tassabehji Exploring Williams-Beuren Syndrome Using myGrid in. Bioinformatics20:i303-310. Proc of 12th Intelligent Systems in Molecular Biology (ISMB), 31st Jul-4th Aug 2004,Glasgow, UK

6. Peter Li, Keith Hayward, Claire Jennings, Kate Owen, Tom Oinn, Robert Stevens, Simon Pearce andAnil Wipat. Association of variations on I kappa B-epsilon with Graves' disease using classical andmyGrid methodologies Proc UK e-Science All Hands Meeting September 2004.

9

Juliana Freire

Managing Dynamic WorkflowsJuliana Freire and Claudio T. Silva

University of Utah – http://www.sci.utah.edu/˜vgc/vistrails

VisTrails: Workflows for Data Exploration.Workflow systems have been traditionally used toautomate repetitive tasks and to ensurereproducibility of results. However, for applicationsthat are exploratory in nature, and in which largeparameter spaces need to be investigated, series ofrelated workflows must be created. At the Universityof Utah, we have been developing VisTrails, aworkflow management system for scientific dataexploration and visualization.A novel feature of VisTrails is an action-basedmechanism which uniformly captures provenanceinformation for both data products and workflowsused to generate these products. As illustrated inFigure 1, a vistrail is a rooted tree in which each nodecorresponds to a version of a workflow, and an edgebetween two nodes dp and dc, where dp is the parentof dc, corresponds to the action applied to dp which generated dc. Instead of storing a set of related workflows, westore the operations or actions that are applied to the workflows—this representation is both simple and compact.By systematically tracking detailed provenance information, VisTrails not only ensures reproducibility, but it alsoallows scientists to easily navigate through the space of workflows and parameter settings used in a givenexploration task. Powerful operations are also possible through direct manipulation of the version tree. Theseoperations combined with an intuitive interface for comparing the results of different workflows, greatly simplifythe scientific discovery process. For example, the stored actions lead to an intuitive macro facility that allows the re-use of workflows—a macro is essentially a sequence of actions can be applied to nodes in the version tree. Bulkupdates can also be applied in a similar fashion—through a sequence of actions, providing a scalable means toexplore an n-dimensional slice of the parameter space of a workflow and generate a large number of data products.Another useful feature enabled by the action-based provenance is the ability to analyze (and visualize) thedifferences between two workflows. Since a workflow is represented by a sequence of actions, the diff between twoworkflows can be computed as the difference between their corresponding sets of actions. Last, but not least, theversion tree structure allows several users to collaboratively, in a distributed and disconnected fashion, modify andsynchronize their vistrails.Interacting with Provenance Information. Maintaining provenance of both the workflow evolution and dataprovenance has many benefits, but it also presents many challenges. A potential problem is informationoverflow—too much data can actually confuse users. An important challenge we need to address is how to designintuitive interfaces and provide adequate functionality to help the user interact with and use the provenanceinformation productively. We are currently investigating interfaces and languages that facilitate the querying andexploration of the provenance data as well as efficient storage strategies.Enabling Domain Scientists to Steer the Data Exploration Process. A big barrier to a more wide-spread use ofscientific workflow systems has been complexity. Although most systems provide visual programming interfaces,assembling workflows requires deep knowledge of the underlying modules and libraries. This often makes it hardfor domain scientists to create workflows and steer the data exploration process. An important goal of our research isto eliminate, or at least reduce this barrier. VisTrails already presents a significant step towards this goal. Theexisting facilities for scalable parameter exploration and workflow re-use give domain scientists a high degree offlexibility to steer their own investigations. There are however, several directions we intend to pursue in future workto further simplify the exploration process. In particular, since the system records all user interactions, it may bepossible to identify in the version tree patterns left behind by experts. An intriguing avenue for further research is totry to extract these patterns, and use the derived knowledge to help other users create new workflows and/or solvesimilar problems.Acknowledgments. Joint work with our students: Erik Anderson, Steven P. Callahan, Emanuele Santos, CarlosScheidegger, and Huy T. Vo. This work is funded by NSF, DOE, IBM, and ARO.

Figure 1: A snapshot of the VisTrails history managementinterface is shown on the left. Each node in the vistrail historytree corresponds dataflow version that differs from its parentby changes to parameters or modules. This tree represents thetrial-and-error process followed to generate insightfulvisualizations, two of which are shown on the right.

10

Dennis Gannon

Workflow Workshop Issues Statement

Dennis GannonIndiana University

Many applications involving scientific workflow are well satisfied by conventional tools.However these is a class of applications that stress many system capabilities to the limit. In thearea of severe storm prediction or any other disaster-modeling/prediction scenario, the workflowsmust respond, in near real-time to event streams from sensors. The occurrence of very specificsequences of events can cause any number of possible scenarios of workflow to be enacted. Forexample, the detection of a pattern of vorticity over a particular plowed field can be matched tohistorical records. The result may be a concurrence of event triggers associated with patters thatcan invoke long-dormant workflows. These dormant workflows may be only templates that arepartially completed that must be filled out based on current conditions. Or they may only be partof a larger solution that must be pieced together without human intervention.

Based on our experience with these workflows here is what we see as fundamental requirementsfor the next generation of scientific workflow tools

• Many e-science tools require a workflow to be executed from an interactive user session.This can be valuable, but the cases described above cannot be done with this constraint.The workflows may need to be very long running, possibly persisting months betweenevents.

• Recovery from failure and exception handling are critical. This is especially true in thecase of workflows composed from widely distributed services or workflows that dependupon external resource availability and dynamic allocation.

• Workflows must be adaptive: often changing conditions, such as an evolving pattern ofrequirements (for example, a storm has greatly intensified) or computing or dataresources that were scheduled to be used are no longer available, may require aworkflow to adapt and select alternative strategies to reach its goals. This requiresexpressiveness in the workflow language that is not well supported in some systems.

• Scientific workflows must also be repeatable and all intermediate data products should becataloged and saved. The workflow system must be deeply tied into the process ofgenerating metadata and a history of its own action that is of sufficient detail to generateprovidence. It should be possible to replay a workflow from any intermediate step.

• Many scientific applications have extremely complex parameter specifications where achange to a parameter used in one component may require a change to a parameter usedin a downstream component. Or a single application component parameter must bepropagated to multiple workflow components. The result is a complex web ofinitialization dependences that are hard to factor in to the workflow specification.

Workflow tools and languages are like programming languages and there will never be oneprogramming language. Different tasks require different tools. However, we feel it is essentialthat our workflow tools are able to interoperate at some level. For example, a BPEL workflowinstance is also a web service instance. Hence it can be used as a component in any workflowsystem that allows web services as components. At the other end, it would be nice if there were acommon workflow runtime system or intermediate form. This would allow a workflow to bedesigned with one system and enacted and monitored with another system and, if needed,replayed with a third tool. Again, for workflows based on a web service components, BPEL is apossible choice for the standard intermediate, but other solutions may also be possible.

11

Karen MyersUser-centric Process Management

Karen L. MyersArtificial Intelligence Center, SRI International

Menlo Park, CA [email protected]

Process management tools hold much promise for facilitating the automation of complextasks in a broad range of domains including business workflow, military operations,device control (e.g., robots, satellite networks), and scientific data analysis. For many ofthese application areas, the objective is not to provide full automation, but rather tosupport a user in formulating, executing and adapting processes. Several factors candictate the need for human-in-the-loop process management tools. In some cases, thedevelopment of the background knowledge required for full automation would beprohibitively expensive. In others, humans may choose to play an active role in processmanagement out of a desire to retain overall control, or to understand what is being doneand why.

Over the past several years, we have been developing capabilities to support thevision of user-centric process management. One thrust in this area has been to develop aprocess management system that can take advice from a user to direct both processgeneration and process execution. A second thrust has been the development of systemsand representations for complex processes, as exemplified by the SPARK executionenvironment (http://www.ai.sri.com/~spark).

Our model of advice for process generation is motivated by the fact that complextasks generally admit a broad range of solutions, which can vary substantially in theirdesigns. For example, it may be possible to complete a data analysis task quickly byaccessing extremely powerful computational resources at significant cost; in contrast, analternative solution may take substantially longer but require less expensive resources.Our advice language enables a user to specify preferences over solution features of thistype, thus providing the ability to bias process generation towards solutions that arecustomized to individual needs. Another type of advice is a sketch that outlines keycomponents of a process. With this form of advice, a user can specify the essentialcomponents of a process, and then have the system fill in the details around it to producea complete and correct set of activities for accomplishing a given task.

Our work has also considered advice at process execution time. One class ofexecution-time advice focuses on preferences for runtime decisions that that the systemcan make. For example, the user could indicate that usage of a certain resource shouldnot exceed a given threshold. A second class sets boundaries on decisions that the systemis allowed to make without user involvement. For example, a user could exploit thiscapability within wet lab experiments to designate specific points of user interaction forvarying a protocol, before settling on a final workflow to be used in a “batch” process.

Our work on advisability focuses on user directability of process management.Many other technical problems will need to be addressed to complete the vision of a user-centric process management system. Areas for future consideration include explanationof system behavior, proactive assistance to improve processes, and better techniques foracquiring and adapting process knowledge.

12

Walt Scacchi

Discovering, Modeling, Analyzing, and Re-enactingScientific Work Processes and Practices:

Research Statement

Walt ScacchiInstitute for Software ResearchUniversity of California, IrvineIrvine, CA 92697-3455 USA

http://www.ics.uci.edu/~wscacchi

Ongoing ResearchI have been engaged in empirical studies of technical development and scientific researchprocesses, practices, and community projects. My recent research has focused on the discoveryof work processes and practices associated with the continuing development of free/open sourcesoftware development (F/OSSD) systems in domains such as Internet infrastructure, computergames, academic software design, X-ray astronomy, and grid computing. These investigationsgive rise to informal, semi-structured, formal (computational and re-enactable), andethnographic models of the situated practices, informal workflows, and recurring processes thatcharacterize the observed processes and practices. These models can be analyzed, simulated,visualized, re-enacted, and redesigned (optimized) using a variety of automated tools andtechniques. Such activities enable the models to be validated, reused, redeployed, andcontinuously improved in different technical or scientific work settings.

Upcoming Challenges in Scientific WorkflowsMuch of the recent study examining the design of software systems supporting anticipatedscientific computing workflows are surprising in a number of ways, For example, proposals fordeveloping and deploying virtual organizations enabled by scientific grids, grid services, orHPC display comparatively little basis in organization science, computer supported cooperativework, or social informatics. As a result, such well-intentioned prescriptions may envision usageschemes or scenarios that are highly rational, but otherwise ungrounded and thus unlikely tosucceed in fitting into or transforming extant scientific work situations. Such an outcome ishowever avoidable. Design of scientific workflows and processes that intend or assumetransformation of established work practices can be informed by empirical study that first seeksto comparatively understand and describe existing “as-is” practices, workflows, and practices ina target setting. In turn, when then proposing alternative workflow or “to-be” process designs,that comparable effort need also be directed to enabling the people whose workflow is to betransformed to participate in articulating and enacting the “here-to-there” progressive andincremental transformations of their practices, workflows and processes. While this might seemcumbersome, it has been shown through a number of case studies of IT-based organizationaltransformation to successfully yield new sustainable workflows, as well as cost-effectiveadoption and integration of transformative information technologies.

13

Ilkay Altintas

New Challenges for User-Oriented Scientific WorkflowsIlkay ALTINTAS

San Diego Supercomputer Center, [email protected]

1 Extended Requirements for Scientific Workflows in KeplerKepler [http://kepler-project.org] is a cross-project collaboration to develop a scientific workflow system

for multiple disciplines. Kepler provides a workflow development, management and executionenvironment intended for support of workflows ranging from local analytical pipelines to large-scaledistributed and data-intensive scientific workflows. Kepler scientific workflows can be defined ascustomizable and extensible processes that combine data and computational processes into a configurable,structured set of steps to implement automated solutions to a scientific problem. Kepler provides a userinterface for scientific workflow operations as well as batch execution infrastructure that can be instantiatedwithout the user interface.Building upon the open-source PtolemyII [http:// ptolemy.berkeley.edu/ptolemyII/] software, Keplerimplements functionality for different technologies by adding components (a.k.a. actors) that performoperations including data access, remote job execution, generic imaging, statistical and mathematicalfunctions. Within Kepler, actor-oriented modeling provides a way to link the functionality of differentactors as a combined analytical set of steps, allowing the data flow from one step to another, andtransforming it at each step. Along with the workflow design and execution features, Kepler has ongoingresearch on a number of built-in system functionalities including support for single sign-in gsi-basedauthentication and authorization; semantic annotation of actors, types, and workflows; creating, publishing,and loading plug-ins as archives using the Vergil user interface; conceptually building hierarchicalworkflows by abstracting the sub-workflows as an aggregate of executable steps; and documenting entitiesof different granularities on-the-fly.In spite of the fact that the development in Kepler is motivated by application pull from different scientificdisciplines and projects, scientific workflows have common requirements. We observed an extension inthese requirements as the technology evolves:

• Ability to access to heterogeneous data and computational resources requiring secure andseamless access to multiple virtual organizations using multiple roles; Frameworks whichdefine efficient ways to connect to the existing data and integrate heterogeneous data frommultiple resources

• An extensible and customizable graphical user interface for scientists from different scientificdomains; Ability to link to different domain knowledge, and to invoke multiple applicationsand analysis tools

• Ability to support computational experiment creation, execution, sharing, reuse and provenance• Ability to track provenance of workflow design, execution, and intermediate and final results;

Efficient failure recovery and smart re-runs• Ability to support the full scientific process by means to use and control instruments, networks

and observatories in observing steps, to scientifically and statistically analyze and control thedata collected by the observing steps, and to set up simulations as testbeds for possibleobservatories

2 Upcoming Challenges for More Usable Scientific WorkflowsTo support the extended requirements mentioned in the previous section, the scientific workflow research

community faces challenges in areas including usability, workflow design methodologies, provenancetracking, new computational models to support dynamic and adaptive changes in the running workflows,failure recovery, re-running workflows with minor changes, authentication and authorization, incorporationof semantic Grid and semantic annotation technologies. Among these challenges, usability and provenancetracking requires special attention, as these features have been worked on less in the previous years, and arevital for the acceptance of scientific workflow technologies by wider scientific communities and for thesuccess of all the other features in a scientific workflow.

14

Roger BargaRoger S. Barga

Database GroupMicrosoft [email protected]

On-Going Research ProjectWorkflow Execution Provenance.This project is an effort to define an extensible model for execution provenance related toworkflow execution and to implement mechanisms that automatically generate provenance datafrom a commercial workflow processing engine. We have three initial research objectives. Thefirst objective is to explore the nature of execution provenance, ranging from the specific stepstaken during workflow enactment to produce a result, generating a record of services invokedduring execution and parameters used, to recording deviations from the given workflow model.Since the intention is that this provenance is a machine readable artifact, we are designing anXML format that reflects our model. The second objective is to demonstrate the practicality ofgenerating this provenance data automatically from a workflow enactment engine. Dynamicallycreating a workflow execution trace, in particular one that can be re-executed is a challenge thatdepends largely on the capabilities of the runtime in which the workflow was executed. The thirdobjective is to define management procedures to efficiently store workflow provenance, andconstruct algorithms to query and reason over this data.

Upcoming Challenges in Scientific WorkflowsIdentify and Formalize Workflow Patterns and Common Language Constructs.Although many research groups actively use workflow management systems to carry outexperiments, there is little consensus as to what should be essential ingredients of a workflowenactment engine or a workflow specification language. Today our community has a number ofworkflow management systems, based on different paradigms that use a large variety of concepts.We believe the scientific workflow community should work to identify common patterns andcontrol flow aspects of scientific workflows, which would help in understanding fundamentalproperties of both workflow enactment engine and the specification language. This would notonly aid those developing a given workflow (experiment) with a workflow server alreadydeployed in their organization, but also help those developing a new workflow engine andhopefully facilitate the exchange of workflow (experiment) designs between research institutions.

Community Standards for Collaborative Provenance and Annotations.Today, scientists face many of the same challenges and opportunities found in enterprisecomputing, namely integrating distributed and heterogeneous resources. Scientists no longer usejust a single machine, or even a single cluster of machines, or a single source of data. Researchcollaborations are becoming more and more geographically dispersed, and often exploitheterogeneous tools, compare data from different sources, and use machines distributed acrossseveral institutions in the world. As the number of scientific resources available on the internetincrease, scientists will increasingly rely on web technology to perform in silico experiments. Inthis computing environment, the generation and collection of experiment provenance it is nowdistributed and cooperative. In order to create a meaningful verification of the experiment willrequire that all of the various distributed resources (participants) cooperate in order to produce acomplete provenance trace. This also requires standards for representing both provenance andprocedures for integrating these records into a complete description for an experiment, as well asstandards for annotations such as the version number of a data set or a web service.

15

Yolanda Gil

Embedding Workflows in the Scientific Discovery Process

Yolanda GilUSC/Information Sciences Institute, [email protected]

Scientists today have the ability to assemble complex workflows by drawing from data and modelsprovided by other scientists. Our prior research has demonstrated the use of Artificial Intelligencetechniques to create semantic descriptions of both data and components and use them to assist users increating valid scientific workflows or alternatively to create workflows fully automatically (1). In anongoing collaboration between ISI’s Intelligent Systems Division and its Center for Grid technologies, wehave developed an approach to the creation of scientific workflows that includes the Wings workflowgeneration system and the Pegasus workflow mapping system. The Wings/Pegasus framework is designedto support reusability of workflows, management of the complexity of the creation of large workflows, andautomation of non-experiment critical computational details. In Wings/Pegasus we distinguish betweenworkflow templates, workflow instances, and executable workflows. Workflow templates are data-independent and execution independent, capture in compact forms the parallel processing of data, andenable automated reasoning to derive domain-relevant metadata properties of new data products.Workflow instances bind workflow templates to data, and therefore contain specifications of computationsand their requirements. Executable workflows are formed by mapping workflow instances to computationand storage resources for execution. Workflow templates are highly reusable, and should be created tocapture proper scientific methodology. We have used the Wings/Pegasus framework to create workflowsfor data-intensive natural language processing and for earthquake simulations that expand workflowtemplates with several dozen types of computations into workflow instances with thousands ofcomputations that are then mapped into distributed execution environments.

Research in the coming decade needs to embed workflows into scientific research processes:• Embedding workflows into their scientific context. Much like workflows combine individual

computations into a meaningful end-to-end analysis, we need to develop techniques to combineindividual workflows into meaningful collections of workflows that are designed to investigate ascientific question. In order to investigate how two alternative models compare, severalworkflows need to be created and executed with a range of standard data sets. We need toinvestigate how to support scientists in this kinds of higher-level processes. This requires movingthe focus away from individual workflows that offer isolated data points out of any context, andfocus instead on higher-level constructs that look more like comprehensive answers to meaningfulscientific questions.

• Embedding workflows into the scientific discovery process. Data mining components andautomated hypothesis formulation and proposal components should be routine in any workflowenvironment. Machine learning research offers a wide range of techniques for finding regularitiesin data, and some have already made scientific contributions such as proposing newcategorizations of stellar objects and molecular properties. These approaches need to be integratedwith interactive processes that support the steering of workflows being created.

• Embedding workflows into scientific publication processes. Every scientific article that proposesa new model or presents new results should be accompanied by the data and the workflows used tosupport those conclusions. Any scientific contribution should be described not only textually butalso computationally. Some scientific publications already support the on-line publication of dataand/or algorithms. Some scientific data repositories already support the publication of meta-datato describe data provenance. A tighter integration of these techniques will be important, andworkflows are a central component of this integration.

One could imagine hyperworkflows that group collections of meaningful workflows and embedworkflows within the overall scientific analysis and discovery process could significantly acceleratescientific progress. Artificial Intelligence can provide crucial components to this vision, includingexpressive representations of scientific descriptions, reasoning and automation capabilities, and machinelearning techniques for data mining and hypothesis formation.

(1) www.isi.edu/ikcap/cat, www.isi.edu/ikcap/pegasus, pegasus.isi.edu.

16

Alexander GraySemi-Automation of Scientific Data Analysis

Alexander Gray, Georgia Institute of Technology

The scientific data analysis of the future. Scientific data analysis, as exemplified recently in fields likeastrophysics and molecular biology, is exhibiting trends which pose significant difficulties:

• Datasets are becoming massive, numerous, and complex. In addition to Moore’s law for CPU speeds,there is an analogous exponential growth law for dataset sizes, with an even larger exponent, causing awidening gap with respect to what we can manage easily. Datasets are appearing from an increasingnumber of more sensitive instruments/sources, and the cross-analysis of these datasets is opening up newopportunities. Thus datasets are becoming more high-dimensional and complex.

• Scientific data analysis is requiring expertise across fields. Extracting patterns from these datasets isincreasingly requiring state-of-the-art statistics, machine learning, applied mathematics, algorithms,computer systems, and graphics. The often large divide between the domain sciences and this expertisemeans that unnecessary simplifications are likely to be used.

Scientific data analysis workflow. The process can be described in terms of the following steps:1. Understand the problem and task2. Specify a dataset3. Gain insight into the data via exploratory data analysis4. Define the inference task (e.g. clustering, outlier detection)5. Define the statistical model to be used (e.g. mixture of Gaussians)6. Define the statistical methodology to be used (e.g. maximum likelihood)7. Derive the parameter estimation (training) algorithm (e.g. EM algorithms)8. Derive a scalable version of the training and prediction algorithms (e.g. kd-tree-based algorithms)9. Implement and test the derived training and prediction algorithms (e.g. in Matlab or C)10. Compare the algorithms with previously tried algorithms11. Write up and present all these steps (e.g. in LaTeX)12. Repeat from some previous point

This process may involve multiple distributed users, datasets, and computers. Note that all of these steps ingeneral require deep knowledge, either of the domain problem, statistical (aka machine learning) theoryand practice, or computational theory and practice.

The Algorithmica Project. We are developing Algorithmica, a system which automates steps 7-9by encoding existing state-of-the-art statistical and computational expertise into templatized abstractschemas, and utilizing techniques from AI search, computational logic, and computer algebra. The systemextends a previous system called AutoBayes (see Gray, Fischer, Schumann, and Buntine, NIPS 2002),which demonstrated the automatic derivation of new EM algorithms customized for the user’s model andwas capable of deriving the results of recent machine learning research papers automatically. We are goingbeyond the capabilities of AutoBayes in four concrete ways, at the moment, to allow: a) temporal modelssuch as hidden Markov models, b) robust estimators, c) massive-dataset algorithms based on kd-trees, d)injection of user heuristics into the search process. Our lab’s work is motivated by our collaborations withastrophysicists (Sloan Digital Sky Survey) and systems biologists (Skolnick lab). Results using ouralgorithms have appeared in Science and Nature. A medium-term goal is to develop versions ofAlgorithmica specialized for our scientist collaborators.

Future directions. Though more far-reaching, in principle steps 2-6 and 10-12 can also beautomated to various degrees. We are concerned with the steps which require deep statistical andcomputational expertise. We welcome collaborators who are concerned with other aspects of creating aneventual scientific data analysis assistant, such as workflow definition and knowledge representation,human-computer interaction and visualization issues, parallel/distributed computing and communicationissues, scientific domain knowledge, and database issues.

17

Jim Hendler

Beyond the WorkflowJim HendlerDepartment of Computer ScienceUniversity of Maryland

In a vision for Cyberinfrastructure, NSF presented an animated Web site that showed anastronomer sitting down at a coffee shop, opening her laptop, and running a simulation ofgalaxies colliding. In a call for comments on the vision, I responded in part with:

I would like to slightly expand the vignette in the cyberinfrastructure web sitestarting with an explanation of why the astronomer was in the coffee shop thatmorning. You see, she had just pulled an all-nighter doing the work she needed toto get the computation ready to run. First, she had to spend a couple of hoursusing Google to see if she could find some programs to use for her simulationmodels. She then spent a couple of hours searching through the appendices ofpapers she found on the open-source physics archives to find the datasets thatrepresented the particular galaxies she wanted to explore (and which, of course,cannot be searched for in Google which has no search capability against data). Atthat point, she had to start chatting with colleagues in Japan (who were justwaking up at that time of night), because the datasets she had found were not inthe format she needed for the program she needed, so she had to find a programthat she could use to convert formats. Towards dawn, she had all the componentsshe needed, unfortunately, she was writing some glue code that would create theworkflow she needed to execute the whole thing. Finally, she called friends at ahalf dozen computer centers as they came in the morning so she could get all thepasswords and keys that would be needed so her code could run in the distributedsystem.

Some of these issues have been explored in projects in the UK's e-science program (andthe new EU "semantic grid" RFP brings some of these issues forward). This work hasshown that workflows must be concerned not only with the computation, but also withthe issues of service discovery and composition, technology for data integration andorganization, and the various issues of security and access within the virtualorganizations necessary for Grid computing. (The modeling of these virtualorganizations is also itself an important research challenge.)

Too often, the workflow issues are viewed primarily from a computational point of view,ignoring the fact that there is an "information management" perspective on theseprocesses that is as important. (In fact, sometimes more important, as the social issuesinvolved in sharing data and processing in the life sciences, for example, often mitigateagainst any sort of automated assistance which cannot include provenance and other"credit sharing" mechanisms.) Taking a holistic view of the "life cycle" of the scientist,including everything from data gathering to publication (and including issues such asadvising, collaborating, etc.) is crucial to making sure that our solutions to scientificworkflow problems are actually useful to scientists. Otherwise, they will not be used.

18

Craig Knoblock

Mixed-Initiative Construction of Workflows

Craig A. KnoblockUniversity Of Southern California

Research StatementIn previous work, we developed a constraint-integration framework, called Heracles, which was

designed for implementing mixed-initiative, multi-source information workflows [KMA+01]. This systemhas been applied to a variety of useful applications, including travel planning [ABK+02], visitorscheduling, and geospatial data integration. However, each of these applications of Heracles required asignificant effort to build and each successful application invariable generated requests for changes,additional sources, and new features. Ideally, we would like to allow the users of the system to create,update, and improve their own worksflow.

To address this problem, we have been working on an approach to building new informationworkflows and modifying existing workflows using a mixed-initiative approach. The basic idea is toconstruct a mixed-initiative application in Heracles that interactively constructs mixed-initiative workflows.Heracles is used for both the authoring environment as well as the resulting application, so the user wouldsee a single uniform interface for the entire system. In order to support the authoring, we address the issuesof how to identify the relevant sources, how to link the sources together, and how to relate the specific datainstances from the various sources. Initial work on this topic is described in [KST05].

As more and more information becomes available online and more tools become available, theproblem of dynamically defining new workflows will become increasingly important. More specifically,some of the critical challenges that need to be addressed to support the definition of new workflowsinclude:

• Tools that allow end-users (e.g., scientists) to define and maintain their own workflows thatinclude their own data sources.

• Techniques for rapidly constructing models of new sources or servcies so that they can be rapidlyand correctly integrated.

• Systems for dynamically integrating data across multiple data sources (i.e, databases or webservices) that were not designed to work together.

• Approaches to recognize and resolve inconsistencies at the level of the data retrived from multiplesources.

References

[ABK+02] Jose Luis Ambite, Greg Barish, Craig A. Knoblock, Maria Muslea, Jean Oh, and Steven Minton. Getting from here to there: Interactive planning and agent execution for optimizing travel. In Proceedings of the Fourteenth Conference on Innovative Applications of Artificial Intelligence (IAAI-2002), pages 862–869, AAAI Press, Menlo Park, CA, 2002.

[KMA+01] Craig A. Knoblock, Steven Minton, Jose Luis Ambite, Maria Muslea, Jean Oh, and Martin Frank. Mixed- initiative, multi-source information assistants. In Proceedings of the World Wide Web Conference, pages 697–707, ACM Press, New York, NY, May 2001.[KST05] Craig A. Knoblock, Pedro Szekely, and Rattapoom Tuchinda. A mixed-initiative system for building

mixed-initiative systems. In Proceedings of the AAAI Fall Symposium on Mixed-Initiative Problem-Solving Assistants, 2005.

19

Luc MoreauProvenance in heterogenous workflows:

towards a standard approach

Luc MoreauUniversity of [email protected]

Very large scale computations are now becoming routinely used as a methodology toundertake scientific research: success stories abound in many domains, including physics(www.griphyn.org), bioinformatics (www.mygrid.org.uk), engineering(www.geodise.org) and geographical sciences (www.earthsystemgrid.org). In thiscontext, `provenance systems' are being regarded as the equivalent of the scientist'slogbook for in silico experimentation. By capturing the documentation of the processthat led to some data, the provenance of such data can be determined by issuing user-specific queries over such documentation. The outcome of such provenance queries canbe used by users to verify how results were achieved or to reproduce them: this isparticularly crucial when results can be obtained only by in silico means and no othervalidation in the physical world is possible.

From an engineering viewpoint, systems are becoming more and more complex and aretypically assembled from multiple heterogeneous technologies, each of them used indifferent parts of the computation. For instance, in the LCG Atlas experiment(atlas.web.cern.ch), Athena and VDT coexist, each capable, in their own way, ofspecifying components, compositions and their execution. Likewise, in our HPDC'05paper, we introduce a bioinformatics application that consists of a mix of VDLworkflows, shell scripts, and Web Services. Taking bioinformatics again, in theory, anapplication could be composed of workflows expressed in Kepler, myGrif Scufl andVDL. Hence, the question of inter-operability arises: how can we ensure that each sub-system can individually provide documentation of their execution so that provenance ofdata can seamlessly be queried across such documentation in a technology independentmanner.

The previous discussion indicates that there is a need for inter-operable solutions, bywhich documentation of actual execution can be made available in the long term, so thatthe provenance of data can be retrieved and tools can be built in order to analyse howresults were produced. Different facets of this topic are being investigated in a number ofprojects.

[PASOA] Provenance Aware Service Oriented Architectures (www.pasoa.org)[EU Provenance] www.gridprovenance.org[SOCA] http://twiki.grimoires.org/bin/view/Soca/WebHome

20

Amit Seth

Semantic and autonomic approaches for workflows in life science research and health care practiceAmit Sheth, LSDIS lab, University of Georgia

Workflows have already been used to effectively automate repeatable tasks. What are the uniquerequirements for scientific workflows? We provide some observations based on collaborations in twodomains:• The first collaboration is in the area of life sciences. It involves a workflow to manage data from high-

throughput glycoproteomics experiments, being developed as part of the Integrated TechnologyResource (NCRR) focusing on biomedical glycomics.

• The second collaboration is in the area of health care. It involves the study of requirements for aworkflow (not yet implemented) to support a heart failure clinical pathway.

Based on these studies, we focus our discussion on adaptive workflows to represent and support thedynamic nature of scientific protocols. The core semantics of a workflow framework should be generic andadaptable to the disparate nature of tasks that characterize scientific processes, namely data transfer, formatconversion, filtering, tracking, merging or mere pruning of an available search space. As the key enabler inachieving adaptive capability, we identify the need for comprehensive modeling of provenance and use ofsemantics to address research issues such as heterogeneity and adaptability with associated support foroptimization.

Our collaboration on life sciences focuses on creating workflows for high-throughout glycoproteomicsexperimentation as part of an overall cancer research program. The core problem is creating highlyconfigurable scientific workflows that have the ability to be dynamically configured based on user-input.We are investigating the formal modeling of the complete experimental lifecycle in glycoproteomics. Ourinitial attempt at this is available as ProPreO. Using this ontology, we hope to achieve the following tasks:(a) semantic annotation of the capabilities and interfaces of computing resources, (b) using the semanticannotation for semi-automated discovery of the resources, and (c) using semantic mappings to integrate theresources. The underlying technology integrating the resources will be semantic Web services powered bythe recent W3C member submission WSDL-S (identified as a key input to W3C workgroup on SemanticAnnotation of WSDL). We are also investigating an ontology based provenance framework that wouldallow the experimental data to be semantically interpreted in the context of the workflow that generated it.This would pave a way for software applications to manage, interpret and reason over the experimental datathat would, in the future, allow knowledge discovery over data created in independently run experiments atdifferent research centers at different times.

In the healthcare domain, the core research issue involves handling data volatility in the context of clinicalcare of patients. We are investigating strategies for associating large amounts of new information andknowledge, culled from various sources in heterogeneous format (e.g. new drug advisories or results fromclinical trials), to the focused need of a physician to follow a care pathway that optimally captures bestpractices. From the computer science perspective, we are investigating modeling clinical pathways asAutonomic Web Processes. AWPs are a next generation highly adaptive workflow technology that arecharacterized by self configuring, self healing, self optimizing and self protecting properties. The AWPscan be augmented with semantic Web techniques of ontology-supported automatic semantic annotation(including its constituent capability for disambiguation) and semantic association to manage the implicationof volatile data on clinical pathway instances. A representative scenario involves showing how a new drugadvisory is relevance to the clinical pathways of existing patients and to optionally propose modification offor consideration to a physician.

To summarize, from our experiences in the life sciences domain, the pertinent research issues includesemantic modeling of the domain, annotation of tools and resources, creating configurable workflows andusing provenance information associated with experimental data for analysis and knowledge discovery. Inthe health care domain, the research issues include evaluating the value of new, complex, andcomprehensive technologies such as AWPs and the semantic Web for modeling adaptive clinical pathways.

For further reading: see LSDIS project: METEOR-S project on Semantic Web Services & Processes,Bioinformatics for Glycan Expressions, IntelliGen (a workflow in genomics research)

21

Ewa DeelmanNSF Workshop on Challenges of Scientific Workflows April 2006

Ewa DeelmanResearch Statement

Much of the workflow research I have been doing involves the development of the Pegasus system(Planning for Execution in Grids, pegasus.isi.edu) and applying it to a variety of application domainsranging from astronomy, gravitational-wave science, earthquake science, biology and others.

Pegasus maps an abstract workflow form that describes the analysis and the data needed withoutidentifying the resources needed to conduct the computations. The mapping process involves not onlyfinding the appropriate resources for the tasks but also may include some workflow restructuring gearedtoward improving the performance of the overall workflow. In order to adapt to a dynamic executionenvironment, Pegasus may also map only portions of a workflow at a time.

Pegasus utilizes various information systems to find the available resources and find the location of datasets (possibly replicated in the environment). When possible Pegasus reduces the workflow based on theavailable intermediate data products. This reduction is useful not only for improving the performance of theoverall workflow but also for improving fault-tolerance. When parts of the workflow fail, they can besubsequently remapped. Successful portions of the workflow are reduced and only the necessarycomputations are performed.

Pegasus can optimize the workflow based on the target execution systems, for example clustering shortrunning tasks together and running them in a master-slave mode on high-performance systems. FinallyPegasus also automatically manages the new data products (both intermediate and final) by registeringthem in various data catalogs. With the help of the VDS kickstart code wrapper, it captures theperformance-based provenance information and saves it for future access.

Many research issues still need to be addressed in workflow management systems. Some issues touch uponimproving workflow performance by developing better application and resource performance models,which in turn can help improve the planning process. The performance models are also necessary foraccurate and cost efficient resource provisioning.

More research needs to target fault tolerance in the planning and in the workflow execution process.Pegasus has some fault tolerant capabilities, however the issue of fault tolerance across the workflowmanagement systems is a greater one and involves a dialogue between workflow composition, workflowplanning, and the workflow execution components.

Debugging is also a major issue, especially in environments such as the Grid, where errors are hard todetect and categorize. Additional complexity stems from the gap between what the user specified (possiblya very high-level analysis specification) and what is actually executed (a very low-level detailed directivesto the Grid). Bridging this gap can be a significant challenge.

Finally, most of the workflow systems today involve a user specifying the workflow in its entirety and thenthe workflow management systems bringing it to execution. Providing support for interactive workflowsposses great challenges where the interactions with the users need to be predictable in terms of time scale.Thus real time performance and QoS guarantees are becoming very important and issues of resourceprovisioning are coming to the forefront.

"Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems", E.Deelman, et al. Scientific Programming Journal, 13 (3), 2005"Pegasus: Mapping Scientific Workflows onto the Grid ," E. Deelman, et al. Across Grids Conference2004, Nicosia, Cyprus, 2004

22

Community Process as Workflow

Ewa Deelman, Ian Foster, Carl Kesselman, Mike Wilde

A workflow is typically viewed as described a single logical activity, specifying a set of operations to beperformed and the conditions under which those operations can be scheduled. We argue here for a broaderperspective, by which workflow captures the ensemble of activities performed by a community over anextended time period. Viewed from this perspective, a workflow may capture (at varying degrees ofabstraction) details of activities that have been planned (“what I want to do”), activities that are currentlybeing performed (what I am doing), and finally, past activities (what I have done). The figure shows therelationship between these activities.

The ability to represent, modify and capture these different perspectives of workflow is important if one isinterested in not only the product of an activity, but also in understanding the process by which the productwas obtained. Such understanding is typically crucial in scientific investigation, in which repeatability,provenance, and the communication of process are all crucial.

Thus, we argue that a research agenda for scientific workflow should address issues associated withmanaging this workflow lifecycle, such as:

• Representations that can capture future, current, and past activities at appropriate levels ofabstraction.

• Workflow execution architectures that support concepts of workflow lifecycle (e.g., versioning) ina comprehensive and integrated manner.

• Techniques for refining workflows from abstract expressions of future activity, to specificexecution plans.

• Supporting services for managing, querying, and editing both individual workflows andcollections of workflows at all stages in their lifecycle.

• Techniques for managing access to collections of workflows and workflow products within multi-person and multi-institutional settings.

• Techniques for mining information about past activities, for example to provide input to theplanning of future work.

Our work on the GriPhyN Virtual Data System (vds.isi.edu) is intended as a step in this direction, and hashelped inform our views on these topics.

ExecutedExecuting

Executable

Not yetexecutable

Query

Edit

ScheduleExecutionenvironment

What IDid

What IWant to Do

What IAm Doing

23

Francisco Curbera

Flow composition and SOA. Perspectives and directionsFrancisco Curbera, IBM T. J. Watson Research, [email protected]

1. Service Component Architecture

Service oriented computing (SOC) and the service oriented architecture (SOA) are fundamentallymodels for distributed software components. Full inter-component interoperability, based on Webservices standards, is a core assumption of the SOC model. SOC, however, is not limited to aparticular distributed computing stack (Web services), since the benefits of a distributedcomponent model extend to legacy protocols and platforms as well. Web services hassuccessfully stressed the notion that implementation characteristics should be decoupled frominteroperability concerns, and has focused on defining an XML based interoperability stack. SOCis directly concerned with the implementation and management of service oriented applicationsand stresses the ability to incorporate multiple runtimes and programming models into anarchitecture of distributed software components.

The Service Component Architecture (SCA) [1] is the first realization of SOC as an explicitcomponent model. Just as and Web Services provide the common abstraction of interoperabilityconcerns, SCA provides a common abstraction of implementation concerns. SCA introduces acommon notion of service components, service types and service implementations as well as anassembly model for connecting service components into service oriented applications. SCA's goalis to be able to accommodate multiple implementation platforms into a single set of componentoriented abstractions. J2EE, BPEL4WS, COBOL, SQL or XML components are only part of thepossible implementation artifacts that SCA intends to support.

Composition is naturally the central development activity in any component model. In SOC it isfound in two forms: process (or flow) oriented composition, such as that provided by theBPEL4WS language, and structural composition, as introduced by the SCA assembly model.The flow and assembly paradigms enable a very rich set of composition capabilities in SOCsystems. They also open a number of interesting research challenges and opportunities.

2. Some Research Perspectives

Flow integration. In SOC, flows provide a primary composition model for services. When usingthe structural composition pattern is used, however, flows themselves are the subject ofcomposition. In this cases the flow execution logic is available (gray-box composition) andpermits a level of analysis not possible in black box composition scenarios. In particular, it is thenpossible to reasons about the following aspects of the (structural) composition:

1. Analysis of behavioral compatibility between partner flows.2. Correctness of the overall composition (deadlock freeness, soundness, etc.).3. Code/flow generation based on compatibility requirements.4. Consistency check between implementation and external specification.

Existing analysis techniques (Petri nets for example) can be used with this purpose [2], but theyare limited by the fact that the explicit control logic in a flow typically captures only part of the fullexecution logic (data relationships are often used to encode the rest). This approach on the otherhand is not limited to services implemented as flows, but is applicable whenever service behavior

24

information in the form of “flow metadata” is available for a service (for example, in the case ofstate machines, rule based or procedural programs etc., from which a flow description can bederived or is provided by the service developer).

Composition of non-functional properties. Both the flow and structural composition models aretypically centered on composition of functional characteristics of services (interfaces, behavioraldescriptions) but leave to a later phase the configuration and composition of non functionalproperties and requirements (NFR). Typically this results in an inefficient and long configurationcycle mostly based on trial and error. There is a clear need (and opportunity) for tools capable of:

1. Validating NFR configuration of composite applications.2. Supporting top-down NFR configuration based on end-to-end specifications.3. Reason and provide early feedback about emergent properties of composites (bottom up

composition of NRFs).

The complexity of these problems is compounded by the differences between different NFRdomains (security versus availability versus performance…), and by the existence of complexcross-domain interactions. On the other hand, explicit composition languages provide end-to-endvisibility of the application structure and provide the appropriate basis for reasoning about end-to-end NFR capabilities.

Internet-scale composition. Flows have been traditionally associated with one or anothermiddleware technology (message oriented middleware, distributed object models or Webservices). A model for internet scale flows, fully compatible with Web technologies (REST) is notyet available. Part of the reason is the focus that the Web architecture [3] places on data(resources) and how much focus it still places on the human to computer interactions. On theother hand, flow technologies such as BPEL consider the Web an interoperability medium, ratherthan a natural execution environment. A flow execution model natively built on REST principlesshould be able to fully take advantage of the Web scalability properties to deliver global scaleprocess oriented composition.

Similarly, loosely coupled media such as the Web or the enterprise service bus (ESB) havetraditionally been used to support component interaction but not the remote assembly ofcomponents into complex solutions. Services typically “invoke” other services across the Web,but remote services typically cannot be “wired” and “configured” in the same way local code orlocal services are. Assembly (structural composition) of services is typically performed “by value”,not “by reference”. In order to do so, remote services need to be exposed as configurable (SCA)components rather than just as interfaces that can be invoked. The result would be to expand thereach of the structural composition model to meet the scale of the Web, and provide a level ofservice reuse far beyond the current Web service invocation model.

3. References

[1] Service Component Architecture, http://www-128.ibm.com/developerworks/library/specification/ws-sca/[2] Axel Martens: Analyzing Web Service Based Business Processes. FASE 2005: 19-33[3] Architecture of the World Wide Web, Volume One, http://www.w3.org/TR/webarch/

25

Thomas Fahringer

ASKALON: A Grid Application Development and Runtime Environment www.askalon.org

Research Statement by Thomas Fahringer, University of Innsbruck, Austria [email protected]

The Askalon Research group is currently working on a high level paradigm for Grid workflowapplications. The user should be shielded from the complexity, details and implementationtechnology (including Web/Grid services) of the Grid when creating Grid workflowapplications. Our approach is based on the AGWL (abstract grid workflow language) whichcovers important control and data flow constructs. Constraints and properties as part of AGWLare used as an interface to the underlying runtime environment to fulfill QoS requirements. Wealso created a UML interface (Teuta) for high level graphical specification of Grid workflows.Askalon also comes with a runtime environment that is based on GT4 services covering workflowscheduling based on the HEFT algorithm, advance reservation, resource management,performance analysis, monitoring and prediction services. Resource management is an essentialpart to bridge the gap between high level workflow activity specification and deployed softwarecomponents running on Grid computers. We have developed the GLARE service for semi-automatic deployment and on-demand provision of application components that can be used todynamically build Grid applications. Performance-oriented Grid workflows require advancedsupport through performance prediction and dynamic performance analysis. We have developed asystematic approach of dynamic performance analysis for Grid workflows that is based on ahierarchy of performance overheads based on which we try to explain the difference betweenideal and actual workflow execution time. The Askalon scheduler is incorporating these servicesto dynamically decide on the mapping of workflow components onto Grid sites. The Askalongroup is also involved in two EU funded projects (K-WF Grid, ASG - adaptive services Grid)that work towards semantic description of Grid applications with the goal to automaticallygenerate Grid workflows.

We are currently looking into the following research problems. A main objective is a scaling andfault tolerand grid middleware. We are in particular worried about performance of basetechnology such as Web services, soap, etc. But also the overhead of important middlewareservices, job submission systems and security features is often substantial that leaves much spacefor optimizations. Currently Askalon is mostly based on performance QOS parameters such asexecution times, bandwidth, latencies, response times, etc. We work on an extension of ourscheduling/optimization services to become more generic for a wider range of QoS parameterscovering also costs, reliability, security, and other parameters. Moreover, we are interested in thenext class of challenging Grid applications going beyond static workflows by targetingapplication classes with dynamic control and data flow interaction among applicationcomponents. Semantics technology must be further explored to elobarate as to what extend it canbe used make Grid middleware intelligent and simplify the creation of Grid applications. Besides scientific workflows we are increasingly interested in commercial and educationalapplications from the area of online gaming and e-learning. Clearly business models for the Grid,tradeoffs between security and performance, self-managed and scaling Grid middleware areother interesting topics that require substantial research now and in the future.

26

Carl Kesselman

See Deelman et al. statement:

“Community Process as Workflow”Ewa Deelman, Ian Foster, Carl Kesselman, Mike Wilde

27

Gregor von Laszewski

The Java CoG Kit Workflow FrameworkContact: Gregor von Laszewski, [email protected], http://www.cogkit.orgScientist, Argonne National Laboratory and Fellow, University of Chicago

An up to date version of this document is available at:* http://wiki.cogkit.org/index.php/NSF_Workshop_Java_CoG_Kit_Workflow_Framework

The Java CoG Kit provides a number of workflow related features that allow us to easily use Gridand non Grid infrastructure. These are primarily of interest to scientific user communities Ourgoal is to design a set of tools that allow the creation, deployment, and instantiation ofdynamically adapting workflows based on conditions defined by the users and the availableinfrastructure. To achieve this goal we have defined a number of elementary tools and conceptsthat help us in making progress to design such an environment.One of the concepts that we introduced with the Java CoG Kit is the availability of a wellaccepted Grid patterns exposed through an abstraction layer that allows abstracting much of thedetails of the underlying Grid infrastructure. Such an abstraction layer can for example be usefulto protect the users from upgrades to the backend Grid infrastructure in case the Workflows runfor month and not just minutes or hours. While using supercomputing centers and large scale Gridinfrastructures such as the TeraGrid, with its extensive software stack, we must be aware thatfuture scientific applications should be immune to changes in the software stack during itslifetime. In addition we need the ability to formulate workflows with a simple extensibleworkflow language that allows the creation of scalable workflow engines. The language must besimple as avariety of specialized workflow engines that focus on particular performance goals setby the user community must be available. We also believe it must be able to specify more thanjust DAGs. Although we see standardization efforts such as BPEL, we must realize that theimplementation of the full standard is challenging and unnecessary for many applications. Wealso observe that some of the BPEL workflow engines may not be as scalable as someapplications may desire. We must analyze which features we need to find a balance betweenfeature, performance, and completeness. Besides the availability of a workflow language we mustbe able to extend the language through a set of core libraries that target the dynamic nature ofthe workflows. This includes fault tolerance, check pointing, but also dynamically created formsallowing us to control the workflow in an interactive mode. Furthermore, we have designed aworkflow service that allows the scheduling of system tasks with a performance improvement ofone magnitude over what a typical Grid execution service is capable of. This much increasedperformance, allows integration of finer grained scientific applications and will provide theability to delegate workflows in a future release.

References

A number of publications can be found at the following Web site:* http://wiki.mcs.anl.gov/gregor/index.php/Gregor_von_LaszewskiIn particular* http://wiki.mcs.anl.gov/gregor/index.php/Gregor_von_Laszewski#las06work* http://wiki.mcs.anl.gov/gregor/index.php/Gregor_von_Laszewski#las06karajan* http://wiki.mcs.anl.gov/gregor/index.php/Gregor_von_Laszewski#las05workflowrepo* http://wiki.mcs.anl.gov/gregor/index.php/Gregor_von_Laszewski#las06waterGeneral information about the Java CoG Kit can be found at* http://wiki.cogkit.org/index.php/Java_CoG_Kit_Documentation