Workflow management: motivation and vision Ela Hunt Ela.Hunt@SystemsX.ch.

Post on 28-Dec-2015

217 views 1 download

Tags:

Transcript of Workflow management: motivation and vision Ela Hunt Ela.Hunt@SystemsX.ch.

Workflow management:motivation and vision

Ela HuntEla.Hunt@SystemsX.ch

2Ela Hunt, SyBIT

Plan

Overview of existing workflows

Gains to be achieved via workflows

Methodological assumptions: how to support and construct workflows with less effort and more effectively

3Ela Hunt, SyBIT

Three areas of workflow use:

Deep sequencing

High content screening

Proteomics

Future: workflows combining those three methodologies, possibly including metabolomics, NMR. etc

4Ela Hunt, SyBIT

Deep sequencing

Management of reads (images) coming off the microscopy devices

Processing of images into sequence files

Aligment to a genome or genome assembly from short reads

Annnotation with data from external sources

Candidate gene/drug target identification

5Ela Hunt, SyBIT

DeepSequencingWorkflow Status (Lausanne)1b. Illuminasequencing Possible extensions

6. DAS server

8. AssociationViewer

7. MicrobeBrowser

1a. Web – sample metadata capture

Perl

4. Submit analysispipeline

2. fileserver

3.

Web-browse

Sequenceanalysis

Meta-data

Sequence data

6Ela Hunt, SyBIT

Deep sequencing workflow status

Lausanne – alignment via Eland (Emmanuel Beaudoing, Sylvain Pradervand)

Basel – under construction (Manuel Kohler)

Zurich – FGCZ – under construction (Remy Bruggmann)

7

Proteomics workflows

MS spectra

Mapping to proteins (merging output from various analysis programs)

Annotation with additional data

ETHZ – Perl scripts and KNIME (Andreas Quandt)

Lausanne, Geneva, Basel (?)

Ela Hunt, SyBIT

8Ela Hunt, SyBIT

ETHZ proteomics example (drawn in KNIME by Andreas Quandt)

9Ela Hunt, SyBIT

Screening workflows

Microscopy, image transfer, compression

Matlab scripts (light intensity adjustment, feature recognition, etc, leading to the identification of features) writing feature counts to a DB/files

Stats and chart generation, sometimes including a user interface showing images (also for training), KNIME, R, Matlab, etc

10Ela Hunt, SyBIT

Screening workflows

Lausanne – Petr Strnad‘s workflows in KNIME, Matlab, MySQL

iBRAIN developed by Berend Snijder - an end-to-end solution with a GUI (shell script, XML, XSLT, HTML)

imageJ in S. Maerkl‘s lab in Lausanne, needing more automation and DB

HCDC (Postgress, Matlab, KNIME)

11Ela Hunt, SyBIT

Lausanne workflow fragment

Loop for every plate…Read availableplates

…read cell datafor the plate in the loop

Calculate the number of centrosomesfor 7 different threshold

12Ela Hunt, SyBIT

iBRAIN overview

Purpose: plates, wells, images => compress images, classify cells into types, count cells of various types, graph

Submit project via drag-and-drop of a file

Monitor progress on cluster via HTML pages

Technology: bash, Matlab, cluster, XML, HTML, web pages generated from a bash script, paths and file names are embedded

13Ela Hunt, SyBIT

iBRAIN use cases

14Ela Hunt, SyBIT

OUR GOALS: addressing technical challenges

Maintainablility (extendability) of the entire workflow

Portability

Automation (end-to-end execution)

Cost savings via code base sharing

Various architectures (storage, clusters)

Multiple logins (security, ease of administration)

Privacy

Most of those can be solved via extending KNIME (next talk)

15Ela Hunt, SyBIT

Extending KNIME:see workflows wiki page

16Ela Hunt, SyBIT

What is KNIME?

A Java workflow management system

Integrates Python, R, Perl, Java snippets, jdbc

GUI – can be used by a bioinformatician

Also server and cluster products (SunGRID engine)

Used at several locations (below P. Strnad‘s at Lausanne)

KNIME Analysis (from P. Strnad)

GFP-Centrin expression threshold

50% of cells have2 centrosomes

Usually exclude 10% of cellswith low GFP-Centrin signalPe

rcen

tage

of c

ells

bel

low

thre

shol

d

KNIME Analysis

Centrosome number

Cell

coun

t

Image Regions Viewer

Image Regions Viewer

21Ela Hunt, SyBIT

Goals of KNIME extension

Maintainablility (extendability)

Portability

Automation (end-to-end execution)

Cost savings via code base sharing

Various architectures (storage, clusters)

Doing away with multiple logins or no logins (security, ease of administration, privacy)

22Ela Hunt, SyBIT

Security

Security – one uname/passw per user, one login that carries out the whole workflow

Will include cluster/db logins

KNIME – needs the concepts of user/session, login, accounting of who did what

Allows for workflow tracking, scientific repeatability, accounting

23Ela Hunt, SyBIT

Distributed data and computation

Data Mover as a KNIME node (expose input params, input and output as KNIME ports) – KNIME abstracts over those, and calls them ports

Usage of clusters (LSF and others, as needed) – probably involving the spawning of several Java workflows distributed over a cluster, also reporting of status as jobs are being processed

24

Language additions

Wrapping for Matlab

Improved wrapping of Perl

Better facilities for R embedding (viewports)

CP2 embedding

Sequence: Eland, MAQ, Bowtie, BWA

Proteomics: Mascot, Xtandem, OMSSA, SpectraSS

Ela Hunt, SyBIT

25Ela Hunt, SyBIT

GUI additions

Job submission GUI

Job monitoring GUI (to show errors in a manner appropriate for a biological user)

Workflow sharing GUI (choose workflow, associate with data)

GUI embedding facility for Java GUIs (currently implementation is too fiddly)

26Ela Hunt, SyBIT

Workflow portability

A reconfiguration tool, based on the XML workflow description format supported by KNIME, in XPath or Xquery (GUI?):

select all data paths and change them

select all software paths and change them

select db/login/cluster user data, update

check the updated values by testing all new parameters, report

for two identical workflow instances, report the config differences

27Ela Hunt, SyBIT

Better workflow management

An open repository of workflow nodes, shared by all KNIME user groups (two parts – mature and beta)

Saving of graphing parameters, so that an entire workflow can be automated

Adding a workflow start node with iteration over directories

Data flow efficiency - data exchange between nodes – via hierarchical structures (XML?) and tables (for Perl?)

28Ela Hunt, SyBIT

Image handling

Image type improvements (this type is under development and may not be mature yet)

Image storage in openBIS (various levels of resolution, by well, plate, etc), with associated indexes, so that stats at various levels can be generated easily

29Ela Hunt, SyBIT

openBIS/B-Fabric connectivity

Access to raw data from KNIME

Image indexing, so that KNIME can effectively query features

Analysis results storage

Dumping of workflow run parameters/outcomes to DB (maybe picking up a workflow from DB)

30Ela Hunt, SyBIT

SQL handling

Better table merging (to merge data from several tables, supported by a query definition), as this is cumbersome

31Ela Hunt, SyBIT

Summary

KNIME is used in Zurich and Lausanne, but does not provide end-to-end processing

List of new requirements was gathered from workflow users

An outline grant submitted to KTI

Your input is needed!