CaBIG Workflow University of Chicago, USA University of Manchester, UK.

32
caBIG Workflow University of Chicago, USA University of Manchester, UK

Transcript of CaBIG Workflow University of Chicago, USA University of Manchester, UK.

caBIG Workflow

University of Chicago, USAUniversity of Manchester, UK

Agenda

• caBIG Workflows: The “BIG” picture• caBIG Workflow Infrastructure

• Semantic Service discovery• Composing the Workflow

• Invoke stateful and secure services

• Workflow execution service• Discovering and Executing caBIG workflows using ca

Grid portal• Examples of caBIG workflows• Future directions

The caGrid ecosystem and the role of workflow

caGrid

data

instruments

computation resource

Virtualization

Security

Connectivity Cancer Data Standards Repository

Discovery Composition

Orchestration

Reuse

Community

Scientific workflow lifecycle

reuse

genera

te

•Workflow as consumer•Easily reuse services for complex experiments.•Workflow as contributor •Workflow as “best practice” wrapped as services.

The caBIG Workflow System

caGrid

Cancer Data Standards Repository

Discovery composition

Execution Reuse

Community

reuse

genera

te

Service discovery based on cancer research metadata.

Data-flow modeling flavor caGrid activity

State management (WSRF)Security (GSI)

Implicit iteration: handle parallel executionWSRF and GSI enforcement

A “Facebook” for caGrid workflows

Workflow Execution. ServiceWorkflows in caGrid Portal

Lymphoma Prediction Workflow

•Scientific value• Use gene-expression patterns associat

ed with two lymphoma types to predict the type of an unknown sample.

• Connect caGrid data service (caArray) with analytical services (PreProcess, SVM and KNN from GenePattern).

•Major steps• Querying training data from experiment

s stored in caArray.• Preprocessing, i.e., normalizing the mic

roarray data.• Predicting lymphoma type using SVM &

KNN services.

•Extension• Generalized the workflow into a cancer

type prediction routine that can be used on other caArray data sets.

*Fig. from MA Shipp. Nature Medicine, 2002

*

MicroArray from

tumor tissue

Microarray

preProcessing

Lymphoma

prediction

Lymphoma Prediction Workflow

Lymphoma type prediction

Acknowledgement: Juli Klemm, Xiaopeng Bian, Rashmi Srinivasa (NCI)Jared Nedzel (MIT)

caGrid Workflow Infrastructure

Semantic based service query

Workflow composition

• 2 default caGrid configurations in Taverna:

• NCI Production caGrid v1.3• Training caGrid

• Configuration – a set of caGrid services belonging to the same grid

• Other “caGrids” can be defined through preferences

Configuring Taverna

Semantic Service Discovery

• Semantic search – searches Index Service for registered caGrid services matching various search criteria:• Service name, inputs, outputs, research center,

class names, concept codes, etc.

Adding caGrid services directly

• If user knows WSDL url of a caGrid service – the service can be added directly

caBIG services palette

• As a result of semantic search or direct adding• caBIG services appear in Taverna’s Service Panel• Ready to be drag

and dropped into caGrid workflows

Stateful caGrid services

• Taverna provides support for stateful caGrid services that implement the WSRF spec.

• Taverna can detect if a service is WSRF-compliant and adds special input port ‘EndpointReference’to it

• EPR can be passed aroundthe workflow as normal parameter

Secure caGrid services

• Taverna can invoke secure caGrid services that require user to log in to caGrid

• Taverna interacts with caGrid’s GAARDS infrastructure to obtain user’s proxy:• Authenticate the user with user’s affiliated Authentication Service• Obtain user’s proxy from Dorian Service• Default proxy lifetime: 12 hours

Using secure caGrid services

• Involves:1. Configuring a secure caGrid service from Taverna

2. Logging onto selected caGrid to obtain a proxy certificate

3. Saving and managing caGrid proxies and username and passwords

Configuring secure services (1/2)

• Authentication Service and Dorian Service urls required in order to obtain user’s proxy

• Can be configured globally for all services from the same caGrid (in preferences)

• Can be configured individually for a particular caGrid service (overrides configuration from preferences)

Configuring secure services (2/2)

• View secure’s service details

• Configure service’ssecurity properties

Logging onto caGrid

• User is prompted for his caGrid username and password when any secure service is invoked from a workflow for the first time

Credential management (1/2)

• Taverna obtains proxy for user from Dorian Service using user’s caGrid username and password

• Proxies are saved and managed byCredential Manager

• caGrid username and password can also be remembered

Workflow execution service

Taverna Workflow Service wraps the Taverna execution engine into a WS-Resource and exposes operations such as createResource, startWorkflow, getStatus, and getOutput for user submitted workflows.

startWorkflow

createResource

getStatus

getOutput

Workflow Service

Stateful Resources

(Resource Properties)

Stateful Resources

(Resource Properties)

EPR

Taverna Engine

Data Services

Data Services

Analytical ServicesAnalytical Services

caGrid &

Other Services

Client API

Taverna Workbench Workflow Portlet

Workflow execution service

Taverna Workflow Service Provides stateful resources that execute the workflows.

Supports caGrid security architecture (GSI Security).

Allows programmatic submission of workflows.

Access Taverna workflow via caGrid portal

Taverna Workflow Portlet is deployed in the caGrid Portal on the training Grid:

URL : http://portal-demo.training.cagrid.org/web/guest/tools/taverna-workflow

•The Portlet currently lists a few workflows with their descriptions that can be browsed from the above URL

• Users can select a workflow they are interested in running.

View : 1

Access Taverna workflow via caGrid portal

URL : http://portal-demo.training.cagrid.org/web/guest/tools/taverna-workflow

• Based on the number of input ports in the workflow, the portlet prompts the users to enter the input values in the textbox.

• For example, the Lymphoma workflow takes only one input in the form an Experiment ID that identifies the experiment that caArray uses for data collection.

• Hit submit after the entering the data.

View : 2

Access Taverna workflow via caGrid portal

URL : http://portal-demo.training.cagrid.org/web/guest/tools/taverna-workflow

• The portlet stores the user submitted workflows in the current session of the portal.

• Users can View all the Active and Completed Workflows in the session.

• Clicking the Output Button shows the output of the workflow.

• The portlet provides workflow specific view-resolvers to render the outputs. For E.g: Lymphoma workflow currently displays the output in a html table.

Views : 3, 4, & 5

Ack. Manav Kher, Joshua Phillips (SemanticBits)

Workflow execution service plug-in

• Submit the workflow into an execution servce.

• Retrieve execution result asynchronously.

Examples of caBIG workflows:caDSR

•Scientific value

• To find all the UML packages related to a given context (‘caCore’).

• Not a real scientific experiment.• Simple.• Important in caGrid.

•Steps

• Querying Project object.

• Do data transformation.

• Querying Packages object and get the result.

Workflow

input

caGrid

services

“Shim”

services

Workflow

output

Protein sequence information query

•Scientific value

• To query protein sequence information out of 3 caGrid data services: caBIO, CPAS and GridPIR.

• To analyze a protein sequence from different data sources.

•Steps

• Querying CPAS and get the id, name, value of the sequence.

• Querying caBIO and GridPIR using the id or name obtained from CPAS.

Microarray clustering*

•Scientific value• A common routine to group genes or

experiments into clusters with similar profiles.

• To identify functional groups of genes.•Steps

• Querying and retrieving the microarray data of interest from a caArrayScrub data service at Columbia University

• Preprocessing, or normalize the microarray data using the GenePattern analytical service at the Broad Institute at MIT

• Running hierarchical clustering using the geWorkbench analytical service at Columbia University

Workflow in/output

caGrid services

“Shim” servicesothers

*Wei Tan, Ravi Madduri, Kiran Keshav, Baris E. Suzek, Scott Oster, Ian Foster. Orchestrating caGrid Services in Taverna. ICWS 08.

Execution trace Execution result as xml

1936 gene expressions

caGrid workflows in myExperiment

•caGrid Workflows covered

• Data service workflow• caDSR query

• Protein sequence query

• Data + analytical service• Microarray clustering

• Lymphoma type classification

•caGrid workflows are uploaded to myExperiment and accessible from:http://www.myexperiment.org/workflows/search?query=cabig

Future Directions

• More guidance in workflow modeling

• Leverage caDSR, EVS and the workflows at myExperiment

• More friendly user interface

• A CQL builder for caGrid data services

• More shim services for data transformation

• More features

• Integration with caGrid transfer to access data

• Browsing and executing workflows from caGrid portal

• Enhanced security support

• More workflows of real scientific value

More information

• caGrid workflow

• http://cagrid.org/display/workflow/Home

• Our team

Carole Goble

Univ. Manchester, UK

Univ. Chicago

Wei Tan

Dinanath Sulakhe

Stian Soiland-Reyes

Ravi Madduri

Alexandra Nenadic