Talk given at "Cloud Computing for Systems Biology" workshop
-
Upload
deepak-singh -
Category
Technology
-
view
3.021 -
download
1
Transcript of Talk given at "Cloud Computing for Systems Biology" workshop
The role of cloud compu.ng in big biologyDeepak Singh
Via Reavel under a CC-BY-NC-ND license
life science industry
Credit: Bosco Ho
By ~Prescott under a CC-BY-NC license
context
analysis methods
technology
technology
?
??
?
back of the room
technology
technology
technologytechnology
technology
technology
technologytechnology
techn
ology
technology
technology
tech
nolo
gy
Image: Keith Allison under a CC-BY-SA license
inherent characteristics
data driven
multi-dimensional
collaborative
distributed
<amazon web services>
the cloud
has_many :definitions
infrastructure as a service
precursors
virtualization
service oriented architecure
distributed computing
ComputeAmazon Elastic Compute
Cloud (EC2)- Elastic Load Balancing- Auto Scaling
StorageAmazon Simple
Storage Service (S3)- AWS Import/Export
DatabaseAmazon RDS and
SimpleDB
ComputeAmazon Elastic Compute
Cloud (EC2)- Elastic Load Balancing- Auto Scaling
StorageAmazon Simple
Storage Service (S3)- AWS Import/Export
Content DeliveryAmazon CloudFront
MessagingAmazon Simple
Queue Service (SQS)
PaymentsAmazon Flexible Payments Service
(FPS)
On-Demand Workforce
Amazon Mechanical Turk
Parallel ProcessingAmazon Elastic
MapReduce
DatabaseAmazon RDS and
SimpleDB
ComputeAmazon Elastic Compute
Cloud (EC2)- Elastic Load Balancing- Auto Scaling
StorageAmazon Simple
Storage Service (S3)- AWS Import/Export
Content DeliveryAmazon CloudFront
MessagingAmazon Simple
Queue Service (SQS)
PaymentsAmazon Flexible Payments Service
(FPS)
On-Demand Workforce
Amazon Mechanical Turk
Parallel ProcessingAmazon Elastic
MapReduce
MonitoringAmazon CloudWatch
ManagementAWS Management Console
ToolsAWS Toolkit for Eclipse
Isolated NetworksAmazon Virtual Private
Cloud
DatabaseAmazon RDS and
SimpleDB
ComputeAmazon Elastic Compute
Cloud (EC2)- Elastic Load Balancing- Auto Scaling
StorageAmazon Simple
Storage Service (S3)- AWS Import/Export
Your Custom Applications and Services
Content DeliveryAmazon CloudFront
MessagingAmazon Simple
Queue Service (SQS)
PaymentsAmazon Flexible Payments Service
(FPS)
On-Demand Workforce
Amazon Mechanical Turk
Parallel ProcessingAmazon Elastic
MapReduce
MonitoringAmazon CloudWatch
ManagementAWS Management Console
ToolsAWS Toolkit for Eclipse
Isolated NetworksAmazon Virtual Private
Cloud
DatabaseAmazon RDS and
SimpleDB
scalable
cost effectivescalable
cost effectivescalable
Pay as y
ou go
cost effectivescalable
reliable
cost effectivescalable
reliablesecure
Amazon EC2
servers on demand
highly scalable
3000 CPU’s for one firm’s risk management application
!"#$%&'()'*+,'-./01.2%/'
344'+567/'(.'
8%%9%.:/'
;<"&/:1='
>?,3?,44@'
A&B:1='
>?,>?,44@'
C".:1='
>?,D?,44@'
E(.:1='
>?,F?,44@'
;"%/:1='
>?,G?,44@'
C10"&:1='
>?,H?,44@'
I%:.%/:1='
>?,,?,44@'
3444JJ'
344'JJ'
design for failure
“Everything fails, all the time”-- Werner Vogels
assume failure
design backwards
assume failure
nothing fails
design backwards
assume failure
highly available systems
elastic block store
elastic IP
SQS
US East Region
Availability Zone A
Availability Zone B
Availability Zone C
Availability Zone D
data storage
one size does not fit all
Amazon S3
distributed object store
durable
available
!"#$%&'()*+
T
TT
scalable
fast
simple
structured data anyone?
Amazon SimpleDB
zero administration
highly available
schema less
key-value store
Amazon Relational Data Service
single API call
MySQL database
automatic backup
scale up with API call
futu
res
master-slave replicationfu
ture
s
data center failover
what do people do?
solve problems
> 1PB of data in S3
provide platforms & services
http://heroku.com
Platform as a Service
http://cyclecomputing.com
Computation as a Service
http://cyclecomputing.comhttp://wiki.github.com/documentcloud/cloud-crowd
Computational Platforms
sudo gem install cloud-crowd
http://cyclecomputing.comhttp://wiki.github.com/documentcloud/cloud-crowd
Image: Matt Wood
they do science
3.7 million classifications in just over three days~15 million in less than a month>2.6 million clicks in 100 hours
Image via image editor under a CC-‐BY License
Protein Docking @ Pfizer
http://bioteam.net
http://aws.amazon.com/publicdatasets/
</amazon web services>
anecdote
collaborative project
800 GB
Image: Wikipedia Commons
weeks to get started
Image: Matt Wood
Image: Chris Dagdigian
gigabytes
terabytes
petabytes
really fast
constant flux
Image: Chris Dagdigian
data management is not data storage
masterclassBig data & Biology: The implications of
petascale scienceTuesday November 17
1:30PM - 3:00PM Room: PB253-254-257-258
“science data platform”
deliver data to applications
deliver data to people
typical informatics workflow
Via Christolakis under a CC-BY-NC-ND license
Via Argonne National Labs under a CC-BY-SA license
Via Argonne National Labs under a CC-BY-SA license
killer a
pp
Data
Apps
Data Platform
App Platform
Data Platform
App Platform
Data Platform
App Platform
data services
Data Platform
application services
App Platform
Scalable Data Platform
Services
APIs
Getters Filters Savers
WORK
must accommodate change
must scale
highly available
loosely coupled
dynamic
task-based resources
one projectone set of resources
no waiting
Protein Docking @ Pfizer
http://bioteam.net
distributed mindset
one approach
disk read/writesslow & expensive
data processingfast & cheap
distribute dataparallelize reads
map/reduce
distributed data processingat scale
abstracting away hadoop
apache hive
http://hadoop.apache.org/hive/
apache pig
http://hadoop.apache.org/pig/
hosted hadoop service
hadoop easy & simple
Input S3 bucket
Output S3 bucket
Amazon S3
Hadoop
Amazon EC2 Instances
Input dataset
outputresults
Deploy Application
Web Console, Command line tools
End
Notify
Get ResultsInput Data
Amazon Elastic MapReduce
Hadoop Hadoop
Hadoop
Hadoop
Hadoop
Elastic MapReduce
Elastic MapReduce
developersdevelop & distribute
scientists/analystsconsume
CloudBurst
Catalog k-mers Collect seeds End-to-end alignment
Mike Schatz, University of Maryland
Scalable Data Platform
Services
APIs
Getters Filters Savers
WORK
IN CONCLUSION
large scale biology
complex multidimensional data
whole lot of data
distributed collaborations
new computing and data architectures
a solution: cloud services
distributed
scalable
economical
here today
[email protected] Twi<er:@mndoci Presenta?on ideas from @mza, James Hamilton, and @lessig
Thank you!