Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova...
Transcript of Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova...
![Page 1: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/1.jpg)
Evaluating Cloud Computing for HPC Applications
Lavanya Ramakrishnan CRD & NERSC
![Page 2: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/2.jpg)
2
• What are the unique needs and features of a science cloud? – NERSC Magellan User Survey
• What applications can efficiently run on a cloud? – Benchmarking cloud technologies (Hadoop, Eucalyptus) and platforms (Amazon EC2, Azure)
• Are cloud computing Programming Models such as Hadoop effective for scientific applications?
– Experimentation with early applications • Can scientific applications use a data-as-a-service or software-as-a-service model? • What are the security implications of user-controlled cloud images? • Is it practical to deploy a single logical cloud across multiple DOE sites? • What is the cost and energy efficiency of clouds?
Magellan Research Agenda
![Page 3: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/3.jpg)
3
Magellan User Survey
Program Office
Advanced Scientific Computing Research 17%
Biological and Environmental Research 9%
Basic Energy Sciences -Chemical Sciences 10%
Fusion Energy Sciences 10%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
Access to additional resources
Access to on-demand (commercial) paid resources closer to deadlines
Ability to control software environments specific to my application
Ability to share setup of software or experiments with collaborators
Ability to control groups/users
Exclusive access to the computing resources/ability to schedule independently of other groups/users
Easier to acquire/operate than a local cluster
Cost associativity? (i.e., I can get 10 cpus for 1 hr now or 2 cpus for 5 hrs at the same cost)
MapReduce Programming Model/Hadoop
Hadoop File System
User interfaces/Science Gateways: Use of clouds to host science gateways and/or access to cloud resources through science
Program Office
High Energy Physics 20%
Nuclear Physics 13%
Advanced Networking Initiative (ANI) Project 3%
Other 14%
![Page 4: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/4.jpg)
4
• Infrastructure as a Service (IaaS) – Provide unlimited access to data storage and compute cycles – e.g., Amazon EC2, Eucalyptus
• Platform as a Service (PaaS) – Delivery of a computing platform/software stack – Container/images for specific user groups – e.g., Hadoop, Azure
• Software as a Service – Specific function provided for use across multiple user groups (i.e. Science Gateways)
Cloud Computing Services
![Page 5: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/5.jpg)
Magellan Software
5
Bat
ch Q
ueue
s
Virtu
al M
achi
nes
(Euc
alyp
tus)
Had
oop
Priv
ate
Clu
ster
s
Pub
lic o
r Rem
ote
Clo
ud
Sci
ence
and
Sto
rage
G
atew
ays ANI
Magellan Cluster
![Page 6: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/6.jpg)
• Web-service API to IaaS offering • Uses Xen paravirtualization
– cluster compute instance type uses hardware assisted virtualization
• Non-persistent local disk in VM • Simple Storage Service (S3)
– scalable persistent object store • Elastic Block Storage (EBS)
– persistent, block level storage
Amazon Web Services
![Page 7: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/7.jpg)
• Open source IaaS implementation – API compatible with Amazon AWS – manage virtual machines
• Walrus & Block Storage – interface compatible to S3 & EBS
• Available to users on Magellan testbed • Private virtual clusters
– scripts to manage dynamic virtual clusters – NFS/Torque etc
Eucalyptus
![Page 8: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/8.jpg)
• Platforms – Amazon, Azure, Lawrencium (IT cluster) – Magellan
• IB, TCP over IB, TCP over Ethernet, VM • Workloads
– HPCC – NERSC6 Benchmarks – Applications Pipelines
• JGI Supernova Factory
• Metrics – Performance, Cost, Reliability, Programmability
Virtualization Impact
![Page 9: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/9.jpg)
NERSC-6 Benchmark Performance [1/2]
0
2
4
6
8
10
12
14
16
18
GAMESS GTC IMPACT fvCAM MAESTRO256
Run
time
Rel
ativ
e to
Car
ver
Amazon EC2
Lawrencium
EC2-Beta
EC2-Beta-Opt
Franklin
Carver
![Page 10: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/10.jpg)
NERSC-6 Benchmark Performance [2/2]
0
10
20
30
40
50
60
MILC PARATEC
Run
time
rela
tive
to C
arve
r
Amazon EC2
Lawrencium
EC2-Beta
EC2-Beta-Opt
Franklin
Carver
![Page 11: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/11.jpg)
Magellan: NERSC6 Application Benchmarks
0 10 20 30 40 50 60 70 80 90
100
GTC PARATEC CAM
Perc
enta
ge P
erfo
rman
ce R
elat
ive
to N
ativ
e
TCP o IB
TCP o Eth
VM
![Page 12: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/12.jpg)
Magellan: HPCC
0 2 4 6 8
10 12
Ping Pong Latency RandRing Latency
Perc
enta
ge
Perf
orm
ance
Rel
ativ
e to
Nat
ive
TCP o IB
TCP o Eth
IB/VM
0
20
40
60
80
100
DGEMM Ping Pong Bandwidth
HPL PTRANS Perc
enta
ge P
erfo
rman
ce w
ith
Res
pect
to N
ativ
e
TCP o IB TCP o Eth VM
![Page 13: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/13.jpg)
• James Hamilton’s cost model – http://perspectives.mvdirona.com/2010/09/18/OverallDataCenterCosts.aspx – expanded for HPC environments
• Quantify difference in cost between IB and Ethernet
Performance-Cost Tradeoffs
Application Class Performance Increase/Cost
Increase Tightly Coupled with IO
(e.g., CAM) 4.36
Tightly Coupled with minimal IO (e.g., PARATEC, GTC)
1.2
![Page 14: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/14.jpg)
• Tools to measure the expansion history of the Universe and explore the nature of Dark Energy
– Largest data volume supernova search • Data Pipeline
– Custom data analysis codes • Coordinated by Python scripts
– Run on a standard Linux batch queue cluster • Cloud provides
– Control over OS versions – Root access and shared “group” account – Immunity to externally enforced OS or architecture changes
Nearby Supernova Factory
![Page 15: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/15.jpg)
Experiments on Amazon EC2
15
Input Data Output Data
EBS via NFS Local storage to EBS via NFS
Staged to local storage from EBS
Local storage to EBS via NFS
EBS via NFS EBS via NFS
Staged to local storage from EBS
EBS via NFS
EBS via NFS Local storage to S3
Staged to local storage from S3
Local storage to S3
![Page 16: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/16.jpg)
Total Cost (Instance + Storage/month)
0
20
40
60
80
100
120
140
160
180
EBS-A1 EBS-A2 EBS-B1 EBS-B2 S3-A S3-B
Tota
l Cos
t of a
n ex
perim
ent (
Cos
t of i
nsta
nces
+
stor
gge
cost
/mon
th)
Experiment Matrix
Data Cost per month (80) Instance Cost (80)
Data Cost per month (40) Instance Cost (40)
![Page 17: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/17.jpg)
Magellan Software
17
Bat
ch Q
ueue
s
Virtu
al M
achi
nes
(Euc
alyp
tus)
Had
oop
Priv
ate
Clu
ster
s
Pub
lic o
r Rem
ote
Clo
ud
Sci
ence
and
Sto
rage
G
atew
ays ANI
Magellan Cluster
![Page 18: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/18.jpg)
Hadoop Stack • Open source reliable, scalable
distributed computing – implementation of MapReduce – Hadoop Distributed File System (HDFS)
Core Avro
MapReduce HDFS
Pig Chukwa Hive HBase
Source: Hadoop: The Definitive Guide
Zoo Keeper
![Page 19: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/19.jpg)
Hadoop for Science • Advantages of Hadoop
– transparent data replication, data locality aware scheduling
– fault tolerance capabilities • Mode of operation
– use streaming to launch a script that calls executable
– HDFS for input, need shared file system for binary and database
– input format • handle multi-line inputs (BLAST sequences), binary
data (High Energy Physics)
19
![Page 20: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/20.jpg)
• Compare traditional parallel file systems to HDFS – TeraGen and Terasort to compare file system performance
• 32 maps for TeraGen and 64 reduces for Terasort over a terabyte of data
Hadoop Benchmarking: Early Results [1/2]
0
2000
4000
6000
GPFS HDFS Lustre
Tim
e (s
econ
ds)
TeraGen
Terasort-64 reduces
![Page 21: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/21.jpg)
21
Hadoop Benchmarking: Early Results [2/2]
0
10
20
30
40
50
60
70
0 10 20 30 40 50 60
Thro
ughp
ut M
B/s
Number of concurrent writers
10MB
10GB
TestDFSIO to understand concurrency at default block size
![Page 22: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/22.jpg)
+ ~ 350 - 500 Genomes ~ .5 – 1 Mil Genes
Every 4 months
65 Samples: 21 Studies IMG+2.6 Mil genes 9.1 Mil total
Monthly
On demand
On demand + 330 Genomes ! 158 GEBA
8.2 Mil genes
+ 287 Samples: ~105 Studies + 12.5 Mil genes 19 Mil genes
5,115 Genomes 6.5 Mil genes
IMG Systems: Genome & Metagenome Data Flow
![Page 23: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/23.jpg)
BLAST on Hadoop • NCBI BLAST (2.2.22)
– reference IMG genomes- of 6.5 mil genes (~3Gb in size) – full input set 12.5 mil metagenome genes against
reference • BLAST Hadoop
– uses streaming to manage input data sequences – binary and databases on a shared file system
• BLAST Task Farming Implementation – server reads inputs and manages the tasks – client runs blast, copies database to local disk or ramdisk
once on startup, pushes back results – advantages: fault-resilient and allows incremental
expansion as resources come available
![Page 24: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/24.jpg)
BLAST Performance
0 10 20 30 40 50 60 70 80 90
100
16 32 64 128
Tim
e (m
inut
es)
Number of processors
EC2-taskFarmer
Franklin-taskFarmer EC2-Hadoop
Azure
![Page 25: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/25.jpg)
• Initial config – Hadoop memory ulimit issues, – Hadoop memory limits increased to accommodate high memory tasks – 1 map per node for high memory tasks to reduce contention – thrashing when DB does not fit in memory
• NFS shared file system for common DB – move DB to local nodes (copy to local /tmp). – initial copy takes 2 hours, but now BLAST job completes in < 10 minutes – performance is equivalent to other cloud environments. – future: Experiment with Distributed Cache
• Time to solution varies - no guarantee of simultaneous availability of resources
BLAST on Yahoo! M45 Hadoop
Strong user group and sysadmin support was key in working through this.
![Page 26: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/26.jpg)
• Porting applications still requires lots of work • Public clouds
– Virtualization has a performance impact – Failures when creating large instances – Data Costs tend to be overwhelming
• Eucalyptus – Learning curve and stability at scale
• Alternate stacks and trying version 2.0 – SSE instructions were not exposed in VM
• Additional benchmarking
Summary: Virtualization for Science
26
![Page 27: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/27.jpg)
• Deployment Challenges – all jobs run as user “hadoop” affecting file permissions – less control on how many nodes are used - affects allocation policies – file system performance for large file sizes
• Programming Challenges: No turn-key solution – using existing code bases, managing input formats and data
• Performance – BLAST over Hadoop: performance is comparable to existing systems – existing parallel file systems can be used through Hadoop On Demand
• Additional benchmarking, tuning needed, Plug-ins for Science
Summary: Hadoop for Science
![Page 28: Evaluating Cloud Computing for HPC Applications Lavanya ... · – Largest data volume supernova search • Data Pipeline – Custom data analysis codes • Coordinated by Python](https://reader033.fdocuments.us/reader033/viewer/2022042219/5ec5b31474b3aa2b4907d7d9/html5/thumbnails/28.jpg)
Acknowledgements
This work was funded in part by the Advanced Scientific Computing Research (ASCR) in the DOE Office of Science under contract number DE-C02-05CH11231.
CITRIS/UC, Yahoo M45!, Amazon EC2 Education and Research Grants, Microsoft Research, Wei Lu, Dennis Gannon, Masoud Nikravesh and Greg Bell.
Magellan - Shane Canon, Iwona Sakrejda Magellan Benchmarking – Shane Canon, Nick Wright EC2 Benchmarking – Keith Jackson, Krishna Muriki, John Shalf,
Shane Canon, Harvey Wasserman, Shreyas Cholia, Nick Wright BLAST – Shreyas Cholia, Keith Jackson, Shane Canon, John Shalf Supernova Factory – Keith Jackson, Rollin Thomas, Karl Runge