Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI,...

Taverna and myGridA solution for confusion intensive

computing?Tom Oinn – EMBL-EBI,

[email protected]://mygrid.org.uk http://taverna.sf.net

mailto:[email protected]

http://mygrid.org.uk/

http://taverna.sf.net/

http://www.ebi.ac.uk/Information/Site_Info/picture_book.html



Who are we? myGrid

An EPSRC funded ‘eScience Pilot Project’

Based across multiple sites in the UK

Taverna A tethered spin-off of the

myGrid project Aimed at producing

powerful tools to complement the basic research work

EBI Hinxton Campus

What is Taverna? Allows scientists to graphically construct

complex processes in the form of workflows What is a workflow?

Set of activities that make up a process Definitions about how data moves between these

activities The user specifies what to do but not how to do it Insulates users from the complexity of

distributed computing

Looks a bit like this…

myGrid, Taverna and WBS One of several early adopters of Taverna Manchester based group working on

Williams-Beuren Syndrome in the medical genetics department

Workflows written by life scientists not computer scientists

Following slides stolen at the last minute from Hannah Tipney at Manchester!

Williams-Beuren Syndrome (WBS) Contiguous sporadic gene deletion disorder 1/20,000 live births, caused by unequal crossover (homologous

recombination) during meiosis Haploinsufficiency of the region results in the phenotype Multisystem phenotype – muscular, nervous, circulatory systems Characteristic facial features Unique cognitive profile Mental retardation (IQ 40-100, mean~60, ‘normal’ mean ~ 100 ) Outgoing personality, friendly nature, ‘charming’

Chr 7 ~155 Mb

~1.5 Mb7q11.23

C-cen

C-mid

A-cen

B-mid

B-cen

A-mid

GTF2I

RFC2

CYLN2

GTF2IRD1

NCF1

WBSCR1/E1f4H

LIMK1

ELNCLDN4

CLDN3

STX1A

WBSCR18

WBSCR21

TBL2BCL7B

BAZ1B

FZD9

WBSCR5/LAB

WBSCR22

FKBP6

POM121

NOLR1

GTF2IRD2

B-telA-tel

C-tel

WBSCR14

STAG3PMS2L

Blo

ck A

FKBP6T

POM121NOLR1

Blo

ck C

GTF2IPNCF1PGTF2IRD2P

Blo

ck B

CTA-315H11

CTB-51J22

Gap

Physical Map

Eicher E, Clark R & She, X An Assessment of the Sequence Gaps: Unfinished Business in a Finished Human Genome. Nature Genetics Reviews (2004) 5:345-354Hillier L et al. The DNA Sequence of Human Chromosome 7. Nature (2003) 424:157-164

Williams-Beuren Syndrome Microdeletion

GenBank Accession No

GenBank Entry

Seqret

Nucleotide seq (Fasta)

GenScanCoding sequence

ORFs

prettyseq

restrict

cpgreport

RepeatMasker

ncbiBlastWrapper

sixpack

transeq

6 ORFs

Restriction enzyme map

CpG Island locations and %

Repetitive elements

Translation/sequence file. Good for records and publications

Blastn Vs nr, est databases.

Amino Acid translation

epestfind

pepcoil

pepstats

pscan

Identifies PEST seq

Identifies FingerPRINTS

MW, length, charge, pI, etc

Predicts Coiled-coil regions

SignalPTargetPPSORTII

InterPro

Hydrophobic regions

Predicts cellular location

Identifies functional and structural domains/motifs

Pepwindow?Octanol?

BlastWrapper

URL inc GB identifier

tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr

RepeatMasker

Query nucleotide sequence

BLASTwrapper

Sort for appropriate Sequences only

RepeatMasker

TF binding Prediction

Promotor Prediction

Regulation Element Prediction

Identify regulatory elements in genomic sequence

Experiment

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa

Analysis via ‘Cut and Paste’

A B C

A: Identification of overlapping sequenceB: Characterisation of nucleotide sequenceC: Characterisation of protein sequence

Workflows

The Biological Results

CTA-315H11 CTB-51J22

ELN

WBSCR14

RP11-622P13 RP11-148M21 RP11-731K22

314,004bp extension

All nine known genes identified(40/45 exons identified)

CLDN4

CLDN3

STX1A

WBSCR18

WBSCR21

WBSCR22

WBSCR24

WBSCR27

WBSCR28

Four workflow cycles totalling ~ 10 hoursThe gap was correctly closed and all known features identified

And Now… Pretty Pictures

The first thing users see…

BioMoby (orange), Soaplab (wheat), Workflow (red), SOAP Service (green), SeqHound (blue), Local Java operation (purple), String constant (pale blue)

Different service types, unified.

Launching a workflow…

Invocation progress…

Browsing the results…

Results in context…

Integration Epochs1. Databases / Data warehouses

Integration of data

2. Distributed Queries, Workflows Integration of process

3. Semantic Unification Integration of knowledge

Current state of the art somewhere around 2.5, what do we need to do next?

Last Year’s Problems Multiple data sources

SOA approaches, distributed queries i.e. OGSA-DAI

Heterogeneous computational resources SOA combined with workflow methods Toolkits widely used and deployed i.e. Soaplab,

BioMoby et al. As a community we can provide data and

compute services, and are doing so.

Yesterday’s Problems Usability

Distributed computing and biologists go together like water and mains electricity

Graphical workflow environments now exist e.g. Taverna, Triana, Discovery-Net, Ptolemy…

Can be improved upon but basically usable by the target audience of expert researchers.

Concept Workflows, SOA and friends are now accepted as

a legitimate way of doing things Methods have moved from the ‘out there’

research world to just inside the common scientific toolbox

Functionality Integration of BioMoby, EMBOSS, SOAP

services, command line tools, SeqHound, Web CGIs and others on demand

Fault tolerance and reporting Enactment of complex process flows Some service discovery (crude but surprisingly

effective) Available and widely used (>2500 downloads of

Taverna from http://taverna.sf.net)

Current Work Service Discovery

Doing it properly – semantic registry technology Ontologies for services, data etc. Annotating the corpus of services with metadata

Data management Putting data in context within the scientific

process Managing the new bursts of data from workflow

systems

So Where’s This Confusion Then?

At the moment, invoking a workflow gives results equivalent to a big set of files

Files are data, what we want is knowledge Confusion is formed from data and banished by

the conversion of that data into knowledge This is the problem for Today, Tomorrow and

beyond! So, what are we going to do about it next?

Some Types of knowledge in myGrid and Taverna Data to Context Knowledge

Which operation produced the data? Which workflow defined the operation? When, Where and Who?

Workflow design and enactment!

Data to Data Knowledge Relate operation inputs and outputs

Base ‘derived from’ relation in RDF Can be specialized through templates

Context to Context Knowledge Common information model shared across

components Encapsulates organizations, people, experiment

designs, instances and results. Equivalent to an overall eScience file system

In Silico eScience ‘Materials and Methods’ Expressed in terms of workflow definitions within

Taverna

The eScience Knowledge Gap (one of them anyway!)

Hypothesis is missing! Without some specification of the hypothesis which the

experiment is designed to test we cannot do much more than the forms of knowledge stated previously.

Hypothesis as part of the Process Model? Can we define the hypothesis as the population of a

domain and experiment specific data model in combination with a set of statements about instances of this model?

How would this fit in with the current workflow centric approach we’re taking?

But Domain Modeling is Hard Do we need to model the entire domain?

Derive an experiment specific model by either creating from scratch or aggregating fine grained ‘Atomic Domain Models’ Examples – Sequence + Features, GO Term Graph,

Metabolic Pathway, Protein Interaction Set For example, if the hypothesis is ‘proteins annotated with

GO term xxx or children by InterPro scan are implicated in pathway zzz’ Aggregate target domain model consists of the combination

of these Atomic Domain Models. Hypothesis statement in the form of this model + query over

the model topology which returns the proportion of proteins in the model satisfying the hypothesis constraint.

Populating the Target Domain Model

Workflows are based on the composition of distributed services Can we derive services from the Target Domain Model?

For example, the Sequence + Features model would manifest a setFeature(start, end, sequence, feature) operation or similar.

Allow the user to incorporate these operations into the workflow alongside the regular services, effectively annotating the workflow.

Make use of existing Data to Data Knowledge and Data to Context Knowledge to link entities within the Target Domain Model with derivation information.

Data Transformed to Knowledge A workflow invocation would now result in a

populated domain model as opposed to (or in addition to) a large set of discrete pieces of data.

Explicit semantic in the Target Domain Model Drive hypothesis testing Drive visualization in a graphical UI Generate textual summary of the knowledge

myGrid and WBS People!CoreMatthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes,

Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe.

UsersSimon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical

Medical Sciences, University of Newcastle, UKHannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UKPostgraduatesMartin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan,

Antoon Goderis, Tracy Craddock, Alastair HampshireIndustrial Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)Robin McEntire (GSK)CollaboratorsKeith Decker

AcknowledgementsmyGrid is an EPSRC funded UK eScience Program Pilot Project

Particular thanks to the other members of the Taverna project, http://taverna.sf.net

http://taverna.sf.net/

Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI,...

Documents

Transcript of Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI,...