Cloud BioLinux: pre-Configured and on-demand
computing for genomics independently of institutional,
geographic or economic boundaries
Ntino Krampis, PhD
JCVI-NIAID workshop 2011S. Africa
Expensive sequencing and large organizations
Commodity sequencing and small labs
● large sequencing center, multi-million, broad-impact sequencing projects
● dedicated bioinformatics department, coordination with other centers
● small-factor, bench-top sequencer available: GS Junior by 454
● sequencing as a standard technique in basic biology and genetics research
● RNAseq and ChiPseq, and each biologist will be tackling a metagenome
● downstream bioinformatics analysis for scientific discovery
● many commonly-used bioinformatics tools are difficult to install
● usually available only as source code - needs technical expertise
● large-scale sequence data analysis requires high performance and expensive computing hardware
Acquiring the sequence data is only the first step
● Cloud Computing: large-scale, high performance computers accessible through the Internet
● Example: using Gmail, Google Docs, Yahoo! Mail, FaceBook etc. you store and access data on a remote computer
● Cloud Computing services - Amazon EC2 (http://aws.amazon.com/ec2) rent high computational and data storage capacity on remote computers
Alternative: computational capacity on the cloud
operating system, bioinformatics software and data, are installed in a Virtual Machine (VM)
a VM is uploaded and executed on a cloud computing service
run a practically unlimited number of VMs for large-scale sequence data analysis
access VM on a desktop computer through the Internet
How does Cloud Computing work ?
local desktop computers
Internet
remote Amazon EC2 Cloud Computing service
VM VM VM
● Cloud BioLinux by leverages VM technology and the cloud, offering pre-configured bioinformatics computing
● allow setting up a high-performance data analysis environment, without any technical expertise
● researchers can perform large-scale data analysis, by simply using a desktop computer with Internet access
● accessible without any institutional, economic or national boundaries
Cloud BioLinux
1. sign up for an Amazon EC2 cloud account:
http://aws.amazon.com/ec2 Also can connect an existing account from the main Amazon.com website for the cloud usage charges. We have an account ready for you: Username: [email protected] Password: Nhg4|CL0ud!
2. using the account credentials sign in to the EC2 cloud console (select EC2 in the dropdown menu below the sign-in button):
http://aws.amazon.com/console
3. launch Cloud BioLinux through the cloud console wizard
Launching Cloud BioLinux
Launching Cloud BioLinux
http://aws.amazon.com/console
Click the button :
1. specify the Cloud BioLinux identifier under “Community
AMIs” tab
2. computational capacity: memory,
processor, CPU cores
Launch instance wizard: steps 1 & 2
3. specify a password for login for the Cloud BioLinux desktop, under “User
Data” box
4. remaining steps: all as default, keep
clicking the “Continue” button
until the wizard finishes and you are back to the console
Launch instance wizard: step 3
Launching Cloud
BioLinux
back to the console after we completed
the wizard
Pick a running instance, select
with your mouse and
copy its “Public DNS” address
(Cloud BioLinux
server address on the cloud)
While waiting for Cloud BioLinux to boot up...
● examples of NCBI public datasets on EC2
● bringing the data to the compute
Final step: connecting remotely to Cloud BioLinux
click the NX client icon on your computer's desktop
A. paste the DNS in the “Host” box B. select “Unix”, “Gnome”, remote desktop size
C. “ubuntu” is the default user Login “workshop” is the password we set
What if I want to share my
alignments with a collaborator?
save your data as a new VM
0.10$ / GB / month
at 15GB, it costs 1.5$ / month
share your analysis results: publicly or only with your collaborators
authorized users can access the cloud VM/image with all the software, data, analysis results
Cloud BioLinux
whole system snapshot exchange
start VM / image
perform analysis
snapshot
share
share
snapshot
perform analysis
start VM / image
researcher A researcher B
Cloud BioLinux and Genomic Standards
whole system snapshot exchange
Acknowledgments & Credits
Brad Chapman - development of the fabric scripts and community organizer
Tim Booth, Bela Tiwari, Dawn Field – BioLinux 6.0 development and EC2 documentation
Deepak Singh and AWS - education grant supporting ISMB / BOSC workshop
Justin Johnson – community and sponsorship of cloudbiolinux.com
J. Craig Venter Inst. - time allowed to work on an open-source project
D. Gomez, E. Navarro, J. Shao, I. Singh – JCVI technology innovation
Members of the Cloud Biolinux community:
Enis AfganMichael HeuerRichard HollandMark JensenDave MessinaSteffen MöllerRoman Valls
Thank you !
Top Related