Moeller bosc2010 debian_taverna

20
2010, Boston Community-driven computational biology with Debian and Taverna Steffen Möller, Hajo Krabbenhöft (Lübeck) Alan Williams, Katy Wolstencroft, Carole Goble (Manchester) Andreas Tille, Charles Plessy, David Paleino (Debian) BOSC 2010, Boston

Transcript of Moeller bosc2010 debian_taverna

Page 1: Moeller bosc2010 debian_taverna

2010,  Boston

Community-driven computational biology with Debian and Taverna

Steffen Möller, Hajo Krabbenhöft (Lübeck)Alan Williams, Katy Wolstencroft, Carole Goble (Manchester)

Andreas Tille, Charles Plessy, David Paleino (Debian)

BOSC 2010, Boston

Page 2: Moeller bosc2010 debian_taverna

2010,  Boston

Motivation

● Open Source Bioinformatics continues to grow and improve● steadily increasing number of tools and databases● addressing more and more complex issues

● Bioinformatics found entry into wet-lab routine● strong service units with many diverse projects● single deeply embedded individuals

● Wanted:● Exchange of bioinformatics recipes, as a database or eventually

linked from papers' method sections● Reliable, instant-available powerful external resources to perform

analysis

Page 3: Moeller bosc2010 debian_taverna

2010,  Boston

Dual role of Cloud technologies

● Sharing of physical resources● Computation● Storage

● Sharing of management resources● Reference Images● Pre-downloaded, pre-indexed data

– Amazon public data sets– “whatever BOSC 2010 agrees on” for our Eucalyptus

playground

Page 4: Moeller bosc2010 debian_taverna

2010,  Boston

How to Co-Maintain Cloud Images● Cloud images can be maintained just like regular machines

● The installation of many tools by many people● works, you get somewhere, but then you don't want to touch it again● Is error prone because of inter-dependencies of packages (shared

files, version incompatibilities)

● The partial update of such co maintained images● will most likely break something somewhere → modularity● you want to know what has been done to an image without a

dependency on external web pages → introspection

Page 5: Moeller bosc2010 debian_taverna

2010,  Boston

How to Co-Maintain Cloud Images

Wanted:● Mechanism to allow the individual upgrading of

software tools and integrity checks● Sharing of the effort

– to compile the source code – one wants to install the binaries only whenever possible

– to describe the packages – should be of little overhead or be already available

This is basically what Linux distributions do.

Page 6: Moeller bosc2010 debian_taverna

2010,  Boston

Dual role of Debian● Package provider

● many tens of thousands packages are offered– directly as a Linux distribution

– indirectly via descendents Ubuntu or BioLinux

● technical excellence– coherent builds across many platforms (PowerPC, Intel 32 and 64 bit, AMD,

MIPS) and Kernels (Linux, HURD, BSD, OpenSolaris)

– separation of documentation from binaries, GUI from command line, ...

● Community● bug reports● mailing Lists, special interest groups, you may discuss

– packages that are missing

– problems that many of us have that are yet unsolved

Page 7: Moeller bosc2010 debian_taverna

2010,  Boston

bioinformatics blend● subversion and git repositories for packages

● friendly and open community

● keen on close links with upstream

● Series of tasks within Debian Med – not only bioinformatics:Biology - Debian Med micro-biology packagesBiology development - Debian Med packages for development of micro-biology applicationsContent management - Debian Med content management systemsMedical data - Debian Med suggestions for medical databasesDental - Debian Med packages related to dental practiceEpidemiology - Debian Med epidemiology related packagesHospital information systems - Debian Med suggestions for Hospital Information SystemsImaging - Cross-platform for visualizing, processing and analysing of bioimagesImaging development - Debian Med packages for medical image developmentLaboratory - Debian Med suggestions for medical laboratoriesPharmacy - Debian Med packages for pharmaceutical researchPhysics - Debian Med packages for medical physicistsPractice - Debian Med packages for practice managementPsychology - Debian Med packages for psychologyStatistics - Debian Med statisticsTools - Debian Med several toolsTypesetting - Debian Med support for typesetting and publishing

Page 8: Moeller bosc2010 debian_taverna

2010,  Boston

How to Co-Maintain a Debian Package● Technically

● Do not touch the original source tree

● Create folder “debian” with files

– “control” - description of package + build deps

– “changelog” - version of package and what's new

– “rules” - how to say “make” and “make install”

– “install” - to split documentation from the rest

Should not be more difficult than executing “make all” directly, contact me or the list when running into problems.

● FTP-upload of package to distribution's server

● Sharing of “debian” folder with community with subversion/git/bazaar

● Community-driven security● Web of trust: Creator of package signs with his GPG key prior to upload,

GPG key is signed by others

● Bug reports may block transition of package to “stable” release

Page 9: Moeller bosc2010 debian_taverna

2010,  Boston

Something's missing

● We now have the resources.● packages that auto-transform into Cloud images● machines and disk to compute and store in-/output

● We have quite some Bio* community

● Wanted:● Linking of cloud resources with the desktop● Linking of web resources into it● Exchange and reference of

– Inter-package

– Inter-resource

processes that (have) work(ed for someone) and may be adapted

Page 10: Moeller bosc2010 debian_taverna

2010,  Boston

Dual role of Taverna● Technology:

● Connects files, web services and applications to workflows

● Workflows may comprise other workflows

● Community:

Portal to completeand partial solutionsas workflows onmyExperiment.org

Page 11: Moeller bosc2010 debian_taverna

2010,  Boston

Taverna integrates command line

● Any command executed in the shell can be integrated● local execution, remote execution with ssh or grid● nicely links clouds, packages and web

● Introduction of UseCases as workflow elements● Database with XML-specification of

– Inputs, Outputs and their MIME types– Commmand line and tools it needs

● Purpose-specific wrappers around binaries or scripts

Krabbenhöft et al., Bioinformatics, 2008

Page 12: Moeller bosc2010 debian_taverna

2010,  Boston

Shared UseCase management

Page 13: Moeller bosc2010 debian_taverna

2010,  Boston

Example: Clustering many sequences

● Compute times of several hours are generally not acceptable for public web services

● Not a problem with integrated clouds

CloudImage

Selection

apt-getinstall

t-coffee

StartinstanceLo

cal

Clo

ud

InformTavernaabout

IP number

WorkflowExecution

ResultsInterpretation

Page 14: Moeller bosc2010 debian_taverna

2010,  Boston

Remaining challenge:sharing public data

● Could work like the management of software, but● Often large with frequent updates

users differ in their demands for latest versions

● Involves post-processingusers differ in their demand to perform such

● Clouds could help, but● one would not want to pay for everything all the time● the installation process would need to be transparent to locally

recreate or update or … improve the data

Page 15: Moeller bosc2010 debian_taverna

2010,  Boston

Proposal: getData, a shared Perl script● The script is a large hash table

● extendable by configuration files that may be contributed from various packages, like EMBOSS

● Every entry comprises another hash table with attributes– Name – full name of database

– Source – how to retrieve it

– Post-download – what to do once it has arrived

– Recommends – tools suggested to install with the data

● All very simple and extendable● Direct mirroring of effort performed on the command line● The community can co-maintain this script more easily than

some cloud instance● More on http://wiki.debian.org/getData

Page 16: Moeller bosc2010 debian_taverna

2010,  Boston

Summary● Debian as community and repository for

bioinformatics software● Mailing lists, source code management● FTP servers

● Clouds introduce dynamics into the collaboration● Data flow between packages● Usability● Shared maintenance of public data

● Taverna ● Connects web, grid, cloud instances and local machine● Fosters exchange of experiences with various workflows

Page 17: Moeller bosc2010 debian_taverna

2010,  Boston

References and Acknowledgements

[1] Debian-Med http://debian-med.alioth.debian.org

[2] getData http://wiki.debian.org/getData

[3] Eucalyptus http://www.eucalyptus.com

[4] Taverna http://www.taverna.org.uk

[5] Taverna UseCases http://taverna.nordugrid.org

[6] myExperiment http://www.myExperiment.org

[7] Eucalyptus http://www.eucalyptus.com

The development of the UseCass plugin to Taverna was funded by the “KnowARC” EU project.

Page 18: Moeller bosc2010 debian_taverna

2010,  Boston

Debian/Ubuntu contributes● Impressive number of packages

● Bioinformatics (Bio*, EMBOSS, clustering, ...)● Cheminformatics (autodock, gromacs, ballview, …)● General scientific computing tools and libraries

– Clustering (Torque, Sun Grid Engine, ...)– Eucalyptus Cloud environment

● Automation of database updates and indexing with the “getData” script

Page 19: Moeller bosc2010 debian_taverna

2010,  Boston

Concept: Distro+Workflows+Cloud

● Debian/Ubuntu Linux Distribution● Chem- + Bioinformatics packages● Friendly Community

● Taverna Workflow Suite● Access to services in the web● Access to command line tools via ssh or grids● Exchange of ideas via myExperiment.org

● Eucalyptus or Amazon Clouds● Sharing of databases and indices● Readily available or customized images to instantiate

Page 20: Moeller bosc2010 debian_taverna

2010,  Boston

The Cloud contributes

A platform for individuals to share● Data (“download only once”)● Its management (“update and index only once”)● Experiences (“I show you”)

Physical resources● To be shared in community (“common cluster”)● To be bought on demand (“run at Amazon.com”)

Solutions● Readily usable images – by community or industry● Adaptability to local demands