Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI,...
-
Upload
phillip-bryan -
Category
Documents
-
view
214 -
download
0
Transcript of Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI,...
Taverna and myGridA solution for confusion intensive
computing?Tom Oinn – EMBL-EBI,
[email protected]://mygrid.org.uk http://taverna.sf.net
Who are we? myGrid
An EPSRC funded ‘eScience Pilot Project’
Based across multiple sites in the UK
Taverna A tethered spin-off of the
myGrid project Aimed at producing
powerful tools to complement the basic research work
EBI Hinxton Campus
What is Taverna? Allows scientists to graphically construct
complex processes in the form of workflows What is a workflow?
Set of activities that make up a process Definitions about how data moves between these
activities The user specifies what to do but not how to do it Insulates users from the complexity of
distributed computing
Looks a bit like this…
myGrid, Taverna and WBS One of several early adopters of Taverna Manchester based group working on
Williams-Beuren Syndrome in the medical genetics department
Workflows written by life scientists not computer scientists
Following slides stolen at the last minute from Hannah Tipney at Manchester!
Williams-Beuren Syndrome (WBS) Contiguous sporadic gene deletion disorder 1/20,000 live births, caused by unequal crossover (homologous
recombination) during meiosis Haploinsufficiency of the region results in the phenotype Multisystem phenotype – muscular, nervous, circulatory systems Characteristic facial features Unique cognitive profile Mental retardation (IQ 40-100, mean~60, ‘normal’ mean ~ 100 ) Outgoing personality, friendly nature, ‘charming’
Chr 7 ~155 Mb
~1.5 Mb7q11.23
C-cen
C-mid
A-cen
B-mid
B-cen
A-mid
GTF2I
RFC2
CYLN2
GTF2IRD1
NCF1
WBSCR1/E1f4H
LIMK1
ELNCLDN4
CLDN3
STX1A
WBSCR18
WBSCR21
TBL2BCL7B
BAZ1B
FZD9
WBSCR5/LAB
WBSCR22
FKBP6
POM121
NOLR1
GTF2IRD2
B-telA-tel
C-tel
WBSCR14
STAG3PMS2L
Blo
ck A
FKBP6T
POM121NOLR1
Blo
ck C
GTF2IPNCF1PGTF2IRD2P
Blo
ck B
CTA-315H11
CTB-51J22
Gap
Physical Map
Eicher E, Clark R & She, X An Assessment of the Sequence Gaps: Unfinished Business in a Finished Human Genome. Nature Genetics Reviews (2004) 5:345-354Hillier L et al. The DNA Sequence of Human Chromosome 7. Nature (2003) 424:157-164
Williams-Beuren Syndrome Microdeletion
GenBank Accession No
GenBank Entry
Seqret
Nucleotide seq (Fasta)
GenScanCoding sequence
ORFs
prettyseq
restrict
cpgreport
RepeatMasker
ncbiBlastWrapper
sixpack
transeq
6 ORFs
Restriction enzyme map
CpG Island locations and %
Repetitive elements
Translation/sequence file. Good for records and publications
Blastn Vs nr, est databases.
Amino Acid translation
epestfind
pepcoil
pepstats
pscan
Identifies PEST seq
Identifies FingerPRINTS
MW, length, charge, pI, etc
Predicts Coiled-coil regions
SignalPTargetPPSORTII
InterPro
Hydrophobic regions
Predicts cellular location
Identifies functional and structural domains/motifs
Pepwindow?Octanol?
BlastWrapper
URL inc GB identifier
tblastn Vs nr, est, est_mouse, est_human databases.Blastp Vs nr
RepeatMasker
Query nucleotide sequence
BLASTwrapper
Sort for appropriate Sequences only
RepeatMasker
TF binding Prediction
Promotor Prediction
Regulation Element Prediction
Identify regulatory elements in genomic sequence
Experiment
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
Analysis via ‘Cut and Paste’
A B C
A: Identification of overlapping sequenceB: Characterisation of nucleotide sequenceC: Characterisation of protein sequence
Workflows
The Biological Results
CTA-315H11 CTB-51J22
ELN
WBSCR14
RP11-622P13 RP11-148M21 RP11-731K22
314,004bp extension
All nine known genes identified(40/45 exons identified)
CLDN4
CLDN3
STX1A
WBSCR18
WBSCR21
WBSCR22
WBSCR24
WBSCR27
WBSCR28
Four workflow cycles totalling ~ 10 hoursThe gap was correctly closed and all known features identified
And Now… Pretty Pictures
The first thing users see…
BioMoby (orange), Soaplab (wheat), Workflow (red), SOAP Service (green), SeqHound (blue), Local Java operation (purple), String constant (pale blue)
Different service types, unified.
Launching a workflow…
Invocation progress…
Browsing the results…
Results in context…
Integration Epochs1. Databases / Data warehouses
Integration of data
2. Distributed Queries, Workflows Integration of process
3. Semantic Unification Integration of knowledge
Current state of the art somewhere around 2.5, what do we need to do next?
Last Year’s Problems Multiple data sources
SOA approaches, distributed queries i.e. OGSA-DAI
Heterogeneous computational resources SOA combined with workflow methods Toolkits widely used and deployed i.e. Soaplab,
BioMoby et al. As a community we can provide data and
compute services, and are doing so.
Yesterday’s Problems Usability
Distributed computing and biologists go together like water and mains electricity
Graphical workflow environments now exist e.g. Taverna, Triana, Discovery-Net, Ptolemy…
Can be improved upon but basically usable by the target audience of expert researchers.
Concept Workflows, SOA and friends are now accepted as
a legitimate way of doing things Methods have moved from the ‘out there’
research world to just inside the common scientific toolbox
Functionality Integration of BioMoby, EMBOSS, SOAP
services, command line tools, SeqHound, Web CGIs and others on demand
Fault tolerance and reporting Enactment of complex process flows Some service discovery (crude but surprisingly
effective) Available and widely used (>2500 downloads of
Taverna from http://taverna.sf.net)
Current Work Service Discovery
Doing it properly – semantic registry technology Ontologies for services, data etc. Annotating the corpus of services with metadata
Data management Putting data in context within the scientific
process Managing the new bursts of data from workflow
systems
So Where’s This Confusion Then?
At the moment, invoking a workflow gives results equivalent to a big set of files
Files are data, what we want is knowledge Confusion is formed from data and banished by
the conversion of that data into knowledge This is the problem for Today, Tomorrow and
beyond! So, what are we going to do about it next?
Some Types of knowledge in myGrid and Taverna Data to Context Knowledge
Which operation produced the data? Which workflow defined the operation? When, Where and Who?
Workflow design and enactment!
Data to Data Knowledge Relate operation inputs and outputs
Base ‘derived from’ relation in RDF Can be specialized through templates
Context to Context Knowledge Common information model shared across
components Encapsulates organizations, people, experiment
designs, instances and results. Equivalent to an overall eScience file system
In Silico eScience ‘Materials and Methods’ Expressed in terms of workflow definitions within
Taverna
The eScience Knowledge Gap (one of them anyway!)
Hypothesis is missing! Without some specification of the hypothesis which the
experiment is designed to test we cannot do much more than the forms of knowledge stated previously.
Hypothesis as part of the Process Model? Can we define the hypothesis as the population of a
domain and experiment specific data model in combination with a set of statements about instances of this model?
How would this fit in with the current workflow centric approach we’re taking?
But Domain Modeling is Hard Do we need to model the entire domain?
Derive an experiment specific model by either creating from scratch or aggregating fine grained ‘Atomic Domain Models’ Examples – Sequence + Features, GO Term Graph,
Metabolic Pathway, Protein Interaction Set For example, if the hypothesis is ‘proteins annotated with
GO term xxx or children by InterPro scan are implicated in pathway zzz’ Aggregate target domain model consists of the combination
of these Atomic Domain Models. Hypothesis statement in the form of this model + query over
the model topology which returns the proportion of proteins in the model satisfying the hypothesis constraint.
Populating the Target Domain Model
Workflows are based on the composition of distributed services Can we derive services from the Target Domain Model?
For example, the Sequence + Features model would manifest a setFeature(start, end, sequence, feature) operation or similar.
Allow the user to incorporate these operations into the workflow alongside the regular services, effectively annotating the workflow.
Make use of existing Data to Data Knowledge and Data to Context Knowledge to link entities within the Target Domain Model with derivation information.
Data Transformed to Knowledge A workflow invocation would now result in a
populated domain model as opposed to (or in addition to) a large set of discrete pieces of data.
Explicit semantic in the Target Domain Model Drive hypothesis testing Drive visualization in a graphical UI Generate textual summary of the knowledge
myGrid and WBS People!CoreMatthew Addis, Nedim Alpdemir, Tim Carver, Rich Cawley, Neil Davis, Alvaro Fernandes,
Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pockock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe.
UsersSimon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical
Medical Sciences, University of Newcastle, UKHannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UKPostgraduatesMartin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan,
Antoon Goderis, Tracy Craddock, Alastair HampshireIndustrial Dennis Quan, Sean Martin, Michael Niemi, Syd Chapman (IBM)Robin McEntire (GSK)CollaboratorsKeith Decker
AcknowledgementsmyGrid is an EPSRC funded UK eScience Program Pilot Project
Particular thanks to the other members of the Taverna project, http://taverna.sf.net