Bosc2011 ntino-krampis-full

20
Cloud BioLinux: open source, fully-customizable bioinformatics computing on the cloud for the genomics community and beyond BOSC 2011 - Vienna, Austria Ntino Krampis, PhD Asst. Professor J. Craig Venter Institute (JCVI) [email protected]

Transcript of Bosc2011 ntino-krampis-full

Page 1: Bosc2011 ntino-krampis-full

Cloud BioLinux: open source, fully-customizable

bioinformatics computing on the cloud for the

genomics community and beyond

BOSC 2011 - Vienna, Austria

Ntino Krampis, PhDAsst. Professor

J. Craig Venter Institute (JCVI)[email protected]

Page 2: Bosc2011 ntino-krampis-full

The community is what makes an open source project

Brad Chapman, Tim Booth, Mesude Bicak, Dawn Field, Dan Pass – core development and planning

Enis Afgan, Pjotr Prins, Stephen Möller - and all other members of the cloud biolinux community that move it fwd

J. Craig Venter Inst. - time allowed to work on an open-source project

Page 3: Bosc2011 ntino-krampis-full

Expensive sequencing and large organizations

Commodity sequencing and small labs

● large sequencing center, multi-million, broad-impact sequencing projects

● dedicated bioinformatics department, compute clusters

● small-factor, bench-top sequencer available: GS Junior by 454

● sequencing as a standard technique in basic biology and genetics research

● RNAseq and ChiPseq, and each biologist will be tackling a metagenome

Page 4: Bosc2011 ntino-krampis-full

Will small labs become the long tail of sequencing ?

amount of sequencing

number of labs

Credit: WikiMedia Commons

Page 5: Bosc2011 ntino-krampis-full

“Bioinformatics nation is a land of city-states” Lincoln Stein

● small labs building small-scale bioinformatics infrastructures

● duplication of effort in compiling and installing software tools

● some groups have no hardware, expertise, or time to install and run software

● NEBC BioLinux ( tinyurl.com/BioLinux-NEBC ) 100+ pre-configured tools

● example: glimmer, hmmer, phylip, rasmol, genespring, clustalw, EMBOSS

how about large-scale sequence datasets ?

Page 6: Bosc2011 ntino-krampis-full

Cloud BioLinuxpre-configured and on-demand bioinformatics computing on the cloud

cloudbiolinux.org

+

=

● JCVI cloud computing research

● NEBC BioLinux software repository

● community effort – Hackathon / BOSC 2010 - 11

● Virtual Machine (VM) on Amazon cloud

● large-scale computing independently of institutional or geographic boundaries

● only need a desktop computer with internet access

Page 7: Bosc2011 ntino-krampis-full

http://tinyurl.com/cloud-biolinux-tutorial

signup at

aws.amazon.com

simple for end-users

Page 8: Bosc2011 ntino-krampis-full

Amazon EC2

linux desktop

via remote

desktop client

Page 9: Bosc2011 ntino-krampis-full

What if I want to share my

alignments with a collaborator?

save your data as a new VM

0.10$ / GB / month

at 15GB, it costs 1.5$ / month

Page 10: Bosc2011 ntino-krampis-full

“whole system snapshot exchange” (Dudley and Butte 2010)

capture the state of the computing system and data

software execution parameters and “massaged” input datasets

Page 11: Bosc2011 ntino-krampis-full

● customize Cloud BioLinux based on community requirements

● mix and match software from NEBC or other (DebianMed, Scientific Linux etc.)

● share customized VMs with collaborators, avoiding effort duplication

● deploy Cloud BioLinux on private and local clouds

Cloud BioLinux developer's frameworkcreate cloud VM / images with standardized software configurations

Page 12: Bosc2011 ntino-krampis-full

software domains in bioinformatics: nextgen sequencing, de novo assembly, annotation, phylogeny,

molecular structures, gene expression analysis

github.com/chapmanb/cloudbiolinux

Page 13: Bosc2011 ntino-krampis-full

● based on python-fabric auto-deployment tool

● software components listed in plain text files

● collaborators use files to share descriptions of cloud VM / images

● start with a bare-bones VM / image

● fabric downloads and installs specified software

tinyurl.com/python-fabric

Cloud BioLinux developer's framework

Page 14: Bosc2011 ntino-krampis-full

Cloud Biolinux

The future

● groups.google.com/cloudbiolinux and cloudbiolinux.org

● expand community, receive feedback, add more software to the VM

● scalable computing: SGE (Galaxy Cloudman), Hadoop (cloudgene.uibk.ac.at)

● add next-gen sequencing pipelines, NIH funding - adds effort in development

● We just had a 2-day codefest at the MetaLab, http://metalab.at/

Page 15: Bosc2011 ntino-krampis-full

and before I finish this talk....

Page 16: Bosc2011 ntino-krampis-full
Page 17: Bosc2011 ntino-krampis-full
Page 18: Bosc2011 ntino-krampis-full
Page 19: Bosc2011 ntino-krampis-full
Page 20: Bosc2011 ntino-krampis-full

Thank you !