INFSO-RI-508833 Enabling Grids for E-sciencE High Throughput Bioinformatics analysis on the Grid...

Post on 30-Jan-2016

215 views 0 download

Transcript of INFSO-RI-508833 Enabling Grids for E-sciencE High Throughput Bioinformatics analysis on the Grid...

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

High Throughput Bioinformatics analysis on the Grid

EMBnet/CNBhttp://www.es.embnet.org/

Scientific Workshop, AGM'06

Helsinki, Finland

Grid Workshop, SC '05, Seattle WA, USA 2

Enabling Grids for E-sciencE

INFSO-RI-508833

Summary

HT analysis on the Grid

GROCK architecture

GROCK as Web Service

Thanks

Lessons Learnt

So long and thanks for all the fish!

Grid Workshop, SC '05, Seattle WA, USA 3

Enabling Grids for E-sciencE

INFSO-RI-508833

Why do we want HT?• The short answer

To perform many analysis efficiently

• The long answer– To run multi-process jobs

Evolutionary bootstraps

Docking

Image processing...

– To run many processes High number of users

High number of problems

• Modelling

• Function prediction

• Structure prediction....

Grid Workshop, SC '05, Seattle WA, USA 4

Enabling Grids for E-sciencE

INFSO-RI-508833

GROCK goal

• Why do we want High-Throughput docking?● find best matches between two molecular structures● for a probe molecule against all molecules in a database

● drug against protein● Identify drug function, predict secondary effects

● protein against proteins● Identify protein interactions, build interaction networks

● protein against drugs● Identify candidate drugs for therapy

● Beyond a single organism

Grid Workshop, SC '05, Seattle WA, USA 5

Enabling Grids for E-sciencE

INFSO-RI-508833

So, what? Is it any good?

To tell you the truth:

In and of itself

it is of limited interest

Grid Workshop, SC '05, Seattle WA, USA 6

Enabling Grids for E-sciencE

INFSO-RI-508833

Beg to disagree

Pharmaceutical companies have been using something 3D-QSAR for years

With considerable success

Grid Workshop, SC '05, Seattle WA, USA 7

Enabling Grids for E-sciencE

INFSO-RI-508833

Come on!Do I need to tell you this? Really?

• You should never blindly trust a computer. – Predictions must be verified

– Predictions must be put in perspective

– Predictions are but a small part of a larger protocol

• It is difficult to get access to pharmacological data– Unless you are a Pharma

• GROCK should be part of a larger ensemble

Grid Workshop, SC '05, Seattle WA, USA 8

Enabling Grids for E-sciencE

INFSO-RI-508833

GROCK in context• Predicting protein interaction networks

– HT protein interaction predictions (HT-GROCK? Whoa!)

– Experimental validation Proteomic analysis of experimental results

– Systems Biology modelling

– Analyze macromolecular assemblies (e.g. 3D-EM)

• Predicting new drugs– Build protein models / Analyze protein structure

– Identify putative targets (3D-QSAR, GROCK, WISDOM)

– Screen using QSAR

– Predict possible effects (GROCK, HT-GROCK?Re-Whoa!)

– Experimental validation

Grid Workshop, SC '05, Seattle WA, USA 9

Enabling Grids for E-sciencE

INFSO-RI-508833

Attacking current needs

• GROCK is a tool that makes 3D molecular screening:

● Easy through a simple, intutitive web interface● More reliable than pharmacophores: uses 3D

docking methods● Versatile: uses standard software and data● Efficient: thanks to the Grid (EGEE)● Integrable in other programs as a Web Service

(SOAP or XML-RPC)● And is GPL!

Grid Workshop, SC '05, Seattle WA, USA 10

Enabling Grids for E-sciencE

INFSO-RI-508833

A Real Time example

• Just for fun: Let's run a screening of aspirin against a small test database

● Connect to GROCK server● Upload aspirin● Select options● Run

Grid Workshop, SC '05, Seattle WA, USA 11

Enabling Grids for E-sciencE

INFSO-RI-508833

Grid Workshop, SC '05, Seattle WA, USA 12

Enabling Grids for E-sciencE

INFSO-RI-508833

GROCK: match explorer

For each

pair● show 10 best● 3D coords● PNG● JPEG● PS● PDF● VRML1● VRML2● Jmol

Grid Workshop, SC '05, Seattle WA, USA 13

Enabling Grids for E-sciencE

INFSO-RI-508833

Aspirin (Acetyl salicylic acid)● Induces its effect through phospholipase A2

● Which is not on the search subset itself (sic)

● But has many other effects● on Protein G signalling● modulates hormone stimulated cyclic AMP production● protects against neurotoxicity● is used in dyslipidaemias ● affects pulmonary surfactant● etc... (check PubMed).

Grid Workshop, SC '05, Seattle WA, USA 14

Enabling Grids for E-sciencE

INFSO-RI-508833

Caveats

● Molecular databases are noisy– Plenty of room for enhancement– ...by Biology/Chemistry Structuralists

● Meaningless molecules are included– E.g. irrelevant molecules from uninteresting organisms– Data reduction by representative clustering

● Meaningful molecules may be excluded– E.g. by substitution of a relevant protein by an irrelevant relative

● 3D matching is approximate– E.g. meaningul info not included (like water or ion molecules)

● Users MUST exercise thoughtful criticism– Just like with any other theoretical tool

Grid Workshop, SC '05, Seattle WA, USA 15

Enabling Grids for E-sciencE

INFSO-RI-508833

Next: Architecture

✔ GROCK: HT docking on the Grid

GROCK architecture

GROCK as Web Service

Lessons learnt

Thanks

So long and thanks for all the fish!

Grid Workshop, SC '05, Seattle WA, USA 16

Enabling Grids for E-sciencE

INFSO-RI-508833

GROCK: architecture

• Design:– User– Web Server– Web service– Grid front-end– Grid back-end

• Advantages:– Secure– Fail safe– Efficient– GENERIC

• To be done:– Make restartable

Avoiding “Death eaters”

Grid Workshop, SC '05, Seattle WA, USA 17

Enabling Grids for E-sciencE

INFSO-RI-508833

GROCK design• Command line application

• WS wrapper

• WWW interface

• Provision for easy expansion– Plugin mechanism to add new databases (PDB, HIC-UP, ZINC)

– Plugin mechanism to add new methods (GRAMM, 3D-DOCK)

– Well defined plugin interfaces (roll your own)

• GROCK builds on other tools– Result browser relies on remote WS for generating output

– Generic docking methods

• GROCK may be used to build other tools

Grid Workshop, SC '05, Seattle WA, USA 18

Enabling Grids for E-sciencE

INFSO-RI-508833

Next: WS

✔ GROCK: HT docking on the Grid✔ GROCK architecture

GROCK as Web Service

Lessons learnt

Thanks

So long and thanks for all the fish!

Grid Workshop, SC '05, Seattle WA, USA 19

Enabling Grids for E-sciencE

INFSO-RI-508833

GROCK as a Web Service

– Callable using SOAP or XML-RPC

– Provides its own description and WSDL when invoked with no parameters User-friendly, human readable

– Provides meta-data about itself Source code Usage info Bibliography

– Job monitoring Asynchronous Web Service Dynamic

Grid Workshop, SC '05, Seattle WA, USA 20

Enabling Grids for E-sciencE

INFSO-RI-508833

An asynchronous WS

When invoked, GROCK returns an opaque key that may be used to query it for status and output info:

Keys are generated at random with enough entropy to make them difficult to guess

The key is actually a ‘session ID’ that uniquely identifies a given job request in the file store.

GROCK uses the key to retrieve job status and output

Grid Workshop, SC '05, Seattle WA, USA 21

Enabling Grids for E-sciencE

INFSO-RI-508833

Next: Lessons Learnt

✔ GROCK: HT docking on the Grid✔ GROCK architecture✔ GROCK as Web Service

Lessons learnt

Thanks

So long and thanks for all the fish!

Grid Workshop, SC '05, Seattle WA, USA 22

Enabling Grids for E-sciencE

INFSO-RI-508833

Future directions

• Add support for additional docking methods– DOCK5 (MPI), AutoDock, others

• Add support for other databases– HIC-Up– ZINC subsets

• Exploit Grid distributed storage system– Needed for truly massive jobs (e.g. drug screening)

• Apply architecture to other problems (evolution, 3D reconstruction, high-throughput *)

Grid Workshop, SC '05, Seattle WA, USA 23

Enabling Grids for E-sciencE

INFSO-RI-508833

Next steps• Extend pharmainformatics work

– Molecular modelling (YaMI: MODELLER) Already on its way

– Molecular Dynamics (AMBER, TINKER, NAMD) In collaboration with Raul Isea (RIB), Paulino Gomez-Puertas

(CBM)...

– Cheminformatics (MPQC, NWChem, Car-Parrinello, DFT) If still needed

• Extend interactions work– 3D-EM analysis of macromolecular assemblies (analysis

restarted on February 2006)

– Xmipp (in-house open source package)

– In collaboration with 3D-EM NoE

– Start easy, with most heavy and used applications

Grid Workshop, SC '05, Seattle WA, USA 24

Enabling Grids for E-sciencE

INFSO-RI-508833

Lessons learned• YaMI v7 (Yet another Modeller Interface)

• GridGRAMM– Running a single process takes longer

– But may be worth the wait

– Don't let anybody mislead you: The Grid is a source of raw computing power. Dot.

• HT Docking– All you need is a tight loop, et voilà!

– Really!

– However...

à

Grid Workshop, SC '05, Seattle WA, USA 25

Enabling Grids for E-sciencE

INFSO-RI-508833

Component Based Architecture• Extending GROCK to use additional dockers

• Extending GROCK to use distributed storage

• Extending GROCK to run in non-EGEE environments

• Shows the relevance of choosing appropriate interfaces

• GROCK, YaMI, GridGRAMM themselves require NEW, well thought out interfaces

• Job execution DOES NOT– DRMAA-WG is a estandard for a batch submission API

– Joined DRMAA-WG in February 2006

– Goal: Define a DRMAA binding for PHP

– Build a DRMAA binding for EGEE

Grid Workshop, SC '05, Seattle WA, USA 26

Enabling Grids for E-sciencE

INFSO-RI-508833

Our Advice• Program using a standard API: DRMAA

– Do it once, run on SGE, Condor, GridWay, etc...

• Use third party work whenever possible– To save effort and increase portability

– Remember: Don't over do it! KISS!

• Define plugin interfaces (and document them)– For extensibility

• Define WS invocation interface (and document it)– For integration into other frameworks

• And finally program a trivial loop (always document)– Don't be too worried about performance

– It will be simple, fast and short

Grid Workshop, SC '05, Seattle WA, USA 27

Enabling Grids for E-sciencE

INFSO-RI-508833

Current work and next steps• Build DRMAA API for EGEE

– So that next steps are easier

• Think about best architecture for data distribution– So it is intuitive, effective and simple

• Go ahead– Molecular modelling

– Molecular dynamics

– Molecular reconstruction

– Macromolecular assembly analysis by 3D-EM

– Cheminformatics (if not done yet)

Grid Workshop, SC '05, Seattle WA, USA 28

Enabling Grids for E-sciencE

INFSO-RI-508833

Next: Middleware classes

✔ GROCK: HT docking on the Grid✔ GROCK architecture✔ GROCK as Web Service✔ Lessons learnt

Thanks

So long and thanks for all the fish!

Grid Workshop, SC '05, Seattle WA, USA 29

Enabling Grids for E-sciencE

INFSO-RI-508833

We wish to thank

• YOU ALL– for being here, your help, encouragement, feedback and

support

– and not falling asleep

• The TEAM at CNB– Biocomputing

José M. Carazo, Carlos Pérez-Roca, Enrique de Andrés, Natalia Jiménez, Sjors Schëres,Alfredo

– Bioinformatics José R. Valverde, David J. García

• THE EU for EGEE

Grid Workshop, SC '05, Seattle WA, USA 30

Enabling Grids for E-sciencE

INFSO-RI-508833

Next: That's all folks!

✔ GROCK: HT docking on the Grid✔ GROCK architecture✔ GROCK as Web Service✔ PHP middleware✔ LCG middleware✔ Thanks

So long and thanks for all the fish!

Grid Workshop, SC '05, Seattle WA, USA 31

Enabling Grids for E-sciencE

INFSO-RI-508833

Any questions?