The Einstein Theory of Relativity - Hendrik Antoon Lorentz.pdf
Taverna the story from up-above Antoon Goderis The University of Manchester, UK DART workshop,...
-
Upload
isabell-swenson -
Category
Documents
-
view
215 -
download
0
Transcript of Taverna the story from up-above Antoon Goderis The University of Manchester, UK DART workshop,...
Tavernathe story from up-aboveAntoon Goderis
The University of Manchester, UKhttp://www.mygrid.org.uk/tavernahttp://www.omii.ac.uk
DART workshop, Brisbane, Australia, 14 December 2006
2
Overview The situation in –omics Creating new biology using Taverna Taverna
Key traits Features on the OMII roadmap
Including today’s release
3
Bioinformaticians & co.
4
Open environmentData, Data, Data
EBI
SeqHoundSRS
National Center for Biotechnology Information (USA)
Cambridge, UKTokyo, Japan
5
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
6
The situation in {genomics, transcriptomics, proteomics,
metabolomics ..} Lots of data Lots of parameters to choose An analysis takes a long time The analysis services are unreliable Lots of analysis steps Need to record and explain your steps
7
Enter workflows Lots of data
[high throughput] Lots of parameters to choose
[best practice] An analysis takes a long time
[long running] The analysis services are unreliable
[fault tolerance] Lots of analysis steps
[data and control flow] Need to record and explain your steps
[provenance]
8
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg
Workflow-based middleware
9
myGrid myGrid http://www.mygrid.org.uk UK e-Science pilot project since 2001 Part of the Open Middleware Infrastructure Institute UK Build middleware for Life Scientists that enables them
to undertake in silico experiments and share those experiments and their results.
Individual scientists, in under-resourced labs, who use other people’s applications.
Open source. Workflows & Semantic Techologies for metadata
management. Data flows. Ad hoc & exploratory
10
Overview The situation in -omics Creating new biology using Taverna Taverna
Key traits Features on the OMII roadmap
Including today’s release
11
?200
Microarray + QTL
Genes captured in microarray experiment and present in QTL region
Phenotypic response investigated using microarray in form of expressed genes or evidence provided through QTL mapping
Genotype Phenotype
[Andy Brass, Steve Kemp, Paul Fisher, 2006]
12
Key:
A – Retrieve genes in QTL region
B – Annotate genes with external database Ids
C – Cross-reference Ids with KEGG gene ids
D – Retrieve microarray data from MaxD database
E – For each KEGG gene get the pathways it’s involved in
F – For each pathway get a description of what it does
G – For each KEGG gene get a description of what it does
[Andy Brass, Steve Kemp, Paul Fisher, 2006]
13
Result Captured the pathways returned by QTL and
Microarray workflows over the MaxD microarray database
Identified a pathway for which its correlating gene (Daxx) is believed to play a role in trypanosomiasis resistance.
Manually analysis on the microarray and QTL data had failed to identify this gene as a candidate.
[Andy Brass, Steve Kemp, Paul Fisher, 2006]
14
Trichuris muris (mouse whipworm) infection
Identified the biological pathways involved in sex dependence in the mouse model, previously believed to be involved in the ability of mice to expel the parasite.
Manual experimentation: Two year study of candidate genes, processes unidentified
Workflows: trypanosomiasis cattle experiment, was reused without change.
Analysis of the data by a biologist found the processes in a couple of days.
[Joanne Pennock, Paul Fisher, 2006]
15
Changing scientific practice Systematic and comprehensive automation.
Eliminated user bias and premature filtering of datasets and results leading to single sided, expert-driven hypotheses
Dry people hypothesise, wet people validate. “make sense of this data” -> “does this make sense?”
Workflow factories. Different dataset, different result
Accurate provenance.
16
Overview The situation in -omics Creating new biology using Taverna Taverna
Key traits Features on the OMII roadmap
Including today’s release
17
User Uptake ~25000 downloads Systems biology Proteomics Gene/protein annotation Microarray data analysis Medical image analysis Heart simulations High throughput
screening Phenotypical studies Plants, Mouse, Human Astronomy Dilbert Cartoons
18
Finding and Sharing Tools
Taverna Workbench 3rd Party Applications and
Portals
WorkflowEnactor
Service Management
Results Management
ProvenancelogMetadata
DefaultDataStore
CustomStore
DAS
KAVE BAKLAVA
Feta
myExperiment
Utopia
ClientsClients
LSIDs
Workflow enactor
19
Taverna workbench
20
3000+ services Open domain services and
resources, Third party. Enforce NO common data model. No common typing, Missing
metadata.
Soaplab InstantSoap
21
Services Landscape
22
User Interaction Allows a workflow to call
out to an expert human user
E.g. Used to embed the Artemis annotation editor within an otherwise automated genome annotation pipeline
[University of Bergen]
23
Tools, Tools, Tools
Feta Search tool
Pedro Annotation tool
24
Capture and Curation Effort
Ontology and Annotation Curation Team
Franck Tanoh and Katy Wolstencroft
Community Service Providers
Community Scientists
25
Scufl Model
TavernaWorkbench
Shielding & Extensible
plug-ins
Workflow Execution
Application
Workflow enactor
Processor Processor
PlainWeb
Service
Soaplab
Processor
LocalJava App
Processor
WFEnactor
Processor
BioMOBY
Processor
SeqHound
Processor
BioMART
Processor
WSRF
Processor
Beanshell
Simple Conceptual Unified Flow Language
Nested workflows, Automatic iterations,Best guess data type handling
26
Service incompatibility Fix up the services to be compatible or…. Shims – libraries of adapters. Automated data type matching using reasoning over
a mismatch and service ontology
Duncan Hull, myGridKhalid Belhajjame, ISPIDER
27
Shimidentification
Mismatchdetection
28
Service failure? Most services are owned by other people No control over service failure Some are research level
Workflows only as good as the services they connect. Notify failures Instigate retries Set criticality Substitute services
29
Provenance Collection Observes events from
the workflow engine Populates an RDF triple
store with information from these events
Browse interface Simple browser replicates
Taverna’s existing result and status browser
Graphical browser ProQA Query API
urn:data:f2
urn:data:f2
urn:data1urn:data1
urn:data2urn:data2
urn:compareinvocation3urn:compareinvocation3
urn:data12
urn:data12
Blast_report
[input]
[output]
[input]
[distantlyDerivedFrom]
SwissProt_seq
[instanceOf]
Sequence_hit
[hasHits]
urn:hit2….
urn:hit2….
urn:hit1…urn:hit1…
urn:hit50…..
urn:hit50…..
[instanceOf]
[similar_sequence_to]
Data generated by services/workflows
Concepts
[ ]
[performsTask]
Find similar sequence
[contains]
Services
urn:data:3urn:data:3
urn:hit8….
urn:hit8….
urn:hit5…urn:hit5…
urn:hit10…..
urn:hit10…..
[contains]
[instanceOf]
urn:BlastNInvocation3urn:BlastNInvocation3
urn:invocation5urn:invocation5urn:data:f1
urn:data:f1
[output]
New sequence
Missed sequence
[hasName] [hasName
]
literalsDatumCollection
[type]
LSDatum
[type]Properties
[instanceOf]
[output]
[output]
[directlyDerivedFrom]
[Zhao et al 07 provenance challenge paper]
30
31
Provenance Tracking
From which Ensembl gene does pathway mmu004620 come from?
32
Pathway_id KEGG_id Uniprot Ensembl_gene_id
Entrez
dF
dF
dF dF
Workflows over Results
Automatically backtrack through the data provenance graph
33
A workflow marketplace
34
webTaverna GUI - main
35
Overview The situation in -omics Creating new biology using Taverna Taverna
Key traits Features on the OMII roadmap
Including today’s release
36
Ingest Ingest
Early adoptersPioneers
Pioneers ConservativesEarly adoptersPioneers
myGridPre-release
myGrid Release
OMII-UKRelease
Software Engineering
XP
Software Engineering
Quality & Test
Evaluation Evaluation OMII Software Engineering
Quality & TestPrioritise & Plan
Prioritise & Plan
Production Applications & Professional ServicesApplications & Professional Services
myGridAlliance
myGridAlliance
Source-forgecommunity
Source-forgecommunity
37
Who are the OMII Users?
Increasing variation in requirements with the scientific domain.
Different scientific/research domains
End Users
Application Developers
Service and Middleware Developers
Middleware Deployers
Diff
ere
nt a
ctivitie
s
Systems Administrators
38
Taverna is now part of OMII-UK Taverna 1.5 – Today! Taverna 1.6 myExperiment
39
Integrated provenance Raven release mechanism to simplify updates
for the user +/- 300 semantic annotations for core services Patterns for using proxies for bulk data
transactions Redeveloped plug in and enactor framework,
improved iteration events, data management
Taverna 1.5
40
Integrated provenance
Taverna 1.5
41
Integrated provenance Raven release mechanism to simplify updates for the
user
Taverna 1.5
42
Integrated provenance Raven release mechanism to simplify updates for the
user +/- 300 semantic annotations for core services
Add_ncbi_to_string : beanshell script, need to ask Paul for more detailsInput:Output:
Kegg_gene_ids_all_species (bconv): converts external IDs to KEGG IDs [mapping]string: External ID . e.g. NCBI ID [Genebank_GI] return: KEGG gene ID [KEGG_record_id]
Get_pathways_by_genes: Search all pathways which include all the given genes [Searching]Input: List of KEGG genes id [KEGG_gene_id]Output: Return a list of pathway_id of specified KEGG genes_id
Merge_pathwaysStringlistConcatenated
This workflow takes in Entrez gene ids then adds the string "ncbi-geneid:" to the start of each gene id. These gene ids are then cross-referenced to KEGG gene ids. Each KEGG gene id is then sent to the KEGG pathway database and its relevant pathways returned.
Taverna 1.5
43
Integrated provenance Raven release mechanism to simplify updates for the
user +/- 300 semantic annotations for core services Patterns for using proxies for bulk data transactions Redeveloped plug in and enactor framework, improved
iteration events, data management
Taverna 1.5
44
Taverna 1.6 Due out Summer 2007
Revised enactment core Native support for long running workflows Data proxy to deal with bulk data transactions Improved service discovery and provenance
management
46
Obtaining Taverna Taverna is available under the LGPL from our
project site on Sourceforge.net http://taverna.sourceforge.net
Win32, Solaris / Linux & OS-X Includes online and downloadable user
manual, examples etc. Support via project mailing lists
47
Conclusions See plans for Taverna 2.0 on myGrid wiki Taverna development is user-driven
Please keep in touch and tell us what you would like to see by the myGrid mailing lists: Taverna Users, Taverna Hackers
Taverna http://taverna.sourceforge.netmyGrid http://www.mygrid.org.ukOMII-UK http://www.omii.ac.uk
48
Phase1 myGrid researchers, Phase2 OMII-UK, myGrid Research Team
Peter Li, Paul Fisher, Andy Brass, Robert Stevens, Mark Wilkinson
EPSRC, Wellcome Foundation, EU
Acknowledgements