Download - Cloud ntino-krampis

Cloud BioLinux: pre-Configured and on-demand

computing for genomics independently of institutional,

geographic or economic boundaries

Ntino Krampis, PhD

JCVI-NIAID workshop 2011S. Africa

Expensive sequencing and large organizations

Commodity sequencing and small labs

● large sequencing center, multi-million, broad-impact sequencing projects

● dedicated bioinformatics department, coordination with other centers

● small-factor, bench-top sequencer available: GS Junior by 454

● sequencing as a standard technique in basic biology and genetics research

● RNAseq and ChiPseq, and each biologist will be tackling a metagenome

● downstream bioinformatics analysis for scientific discovery

● many commonly-used bioinformatics tools are difficult to install

● usually available only as source code - needs technical expertise

● large-scale sequence data analysis requires high performance and expensive computing hardware

Acquiring the sequence data is only the first step

● Cloud Computing: large-scale, high performance computers accessible through the Internet

● Example: using Gmail, Google Docs, Yahoo! Mail, FaceBook etc. you store and access data on a remote computer

● Cloud Computing services - Amazon EC2 (http://aws.amazon.com/ec2) rent high computational and data storage capacity on remote computers

Alternative: computational capacity on the cloud

http://aws.amazon.com/ec2

operating system, bioinformatics software and data, are installed in a Virtual Machine (VM)

a VM is uploaded and executed on a cloud computing service

run a practically unlimited number of VMs for large-scale sequence data analysis

access VM on a desktop computer through the Internet

How does Cloud Computing work ?

local desktop computers

Internet

remote Amazon EC2 Cloud Computing service

VM VM VM

● Cloud BioLinux by leverages VM technology and the cloud, offering pre-configured bioinformatics computing

● allow setting up a high-performance data analysis environment, without any technical expertise

● researchers can perform large-scale data analysis, by simply using a desktop computer with Internet access

● accessible without any institutional, economic or national boundaries

Cloud BioLinux

1. sign up for an Amazon EC2 cloud account:

http://aws.amazon.com/ec2 Also can connect an existing account from the main Amazon.com website for the cloud usage charges. We have an account ready for you: Username: [email protected] Password: Nhg4|CL0ud!

2. using the account credentials sign in to the EC2 cloud console (select EC2 in the dropdown menu below the sign-in button):

http://aws.amazon.com/console

3. launch Cloud BioLinux through the cloud console wizard

Launching Cloud BioLinux


Launching Cloud BioLinux


Click the button :

1. specify the Cloud BioLinux identifier under “Community

AMIs” tab

2. computational capacity: memory,

processor, CPU cores

Launch instance wizard: steps 1 & 2

3. specify a password for login for the Cloud BioLinux desktop, under “User

Data” box

4. remaining steps: all as default, keep

clicking the “Continue” button

until the wizard finishes and you are back to the console

Launch instance wizard: step 3

Launching Cloud

BioLinux

back to the console after we completed

the wizard

Pick a running instance, select

with your mouse and

copy its “Public DNS” address

(Cloud BioLinux

server address on the cloud)

While waiting for Cloud BioLinux to boot up...

● examples of NCBI public datasets on EC2

● bringing the data to the compute

Final step: connecting remotely to Cloud BioLinux

click the NX client icon on your computer's desktop

A. paste the DNS in the “Host” box B. select “Unix”, “Gnome”, remote desktop size

C. “ubuntu” is the default user Login “workshop” is the password we set

What if I want to share my

alignments with a collaborator?

save your data as a new VM

0.10$ / GB / month

at 15GB, it costs 1.5$ / month

share your analysis results: publicly or only with your collaborators

authorized users can access the cloud VM/image with all the software, data, analysis results

Cloud BioLinux

whole system snapshot exchange

start VM / image

perform analysis

snapshot

share

share

snapshot

perform analysis

start VM / image

researcher A researcher B

Cloud BioLinux and Genomic Standards

whole system snapshot exchange

Acknowledgments & Credits

Brad Chapman - development of the fabric scripts and community organizer

Tim Booth, Bela Tiwari, Dawn Field – BioLinux 6.0 development and EC2 documentation

Deepak Singh and AWS - education grant supporting ISMB / BOSC workshop

Justin Johnson – community and sponsorship of cloudbiolinux.com

J. Craig Venter Inst. - time allowed to work on an open-source project

D. Gomez, E. Navarro, J. Shao, I. Singh – JCVI technology innovation

Members of the Cloud Biolinux community:

Enis AfganMichael HeuerRichard HollandMark JensenDave MessinaSteffen MöllerRoman Valls

Thank you !