(ENT206) Migrating Thousands of Workloads to AWS at Enterprise Scale | AWS re:Invent 2014
High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing...
Transcript of High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing...
![Page 1: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/1.jpg)
High Performance Computing on AWS: cfnCluster and Weather Research & Forecasting (WRF) as Examples
Kevin Jorissen - [email protected] Computing – Amazon Web Services
![Page 2: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/2.jpg)
AWS Cloud is a powerful platform for most science workloads, from modest applications to large HPC and big data.
• What’s good for other industries is good for you, too• Huge ecosystem of services beyond compute and storage• Spend more time doing what only you can do: researchers doing science;
IT solving your users’ challenges; teachers creating learning opportunities
AWS = more science for you
![Page 3: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/3.jpg)
“The AWS model works when we have the greatest variety of uncoupled workloads all using the cloud. When it works, it drives the cost of computation down to trivial levels so people can concentrate more on their data, their science and their ideas, rather than bothering to worry about infrastructure.
Science is one of the greatest areas of computation and also happens to be the one that can most benefit from that democratization in cost and global accessibility and where we think Amazon can make a huge, really disruptive, impact on the world by participating - which is, at the most basic level, what we are about as a company.”
AWS = more science for you
![Page 4: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/4.jpg)
AWS Cloud is a powerful platform for most science workloads, from modest applications to large HPC and big data.
• cfnCluster: create a HPC cluster in 10 minutes, with the flexibility of Cloud• WRF: example of a HPC science workload on AWS• WRF: example of a classroom experience supported by AWS
Topics
![Page 5: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/5.jpg)
1. cfnCluster - provision an HPC cluster in minutes
http://aws.amazon.com/hpc/cfncluster/
10 minutes
• Created in minutes• a parallel cluster with master, compute nodes,
NFS shared disk, and job scheduler (sge, openlava, or torque)
• Choose compute nodes based on your needs• Cluster sizes up and down depending on your
work queue• Stretches your research dollars further• There’s no queue in the cloud• Save any infrastructure configuration as a tem
plate
![Page 6: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/6.jpg)
Config options to explore …
#cfncluster
Many options, but the most interesting ones immediately are:
# (defaults to t2.micro for default template)
compute_instance_type = t2.micro
# Master Server EC2 instance type
# (defaults to t2.micro for default template
#master_instance_type = t2.micro
# Inital number of EC2 instances to launch as compute nodes in the cluster.
# (defaults to 2 for default template)
#initial_queue_size = 1
# Maximum number of EC2 instances that can be launched in the cluster.
# (defaults to 10 for the default template)
#max_queue_size = 10
# Boolean flag to set autoscaling group to maintain initial size and scale back
# (defaults to false for the default template)
#maintain_initial_size = true
# Cluster scheduler
# (defaults to sge for the default template)
scheduler = sge
# Type of cluster to launch i.e. ondemand or spot
# (defaults to ondemand for the default template)
#cluster_type = ondemand
# Spot price for the ComputeFleet
#spot_price = 0.00
# Cluster placement group. This placement group must already exist.
# (defaults to NONE for the default template)
#placement_group = NONE
t2.micro is tinyc3.4xlarge might be more interesting …
Min & Max size of your
cluster.
Whether to fall back when things
get quietAlso can use ‘openlava’ or
‘torque’Explore the SPOT
market if you want to save money :-)
A placement group will provision your instances very close to each other
on the network.
![Page 7: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/7.jpg)
System-wide Upgrade from Ivy Bridge to Haswell
#cfncluster
Yes, really :-)
$ ed ~/.cfncluster/config/compute_instance_type/compute_instance_type = c3.8xLarges/c3/c4/pcompute_instance_type = c4.8xLargew949$ cfncluster update boof-cluster
Downgrading is just as easy. Honest.
![Page 8: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/8.jpg)
Infrastructure as code
#cfncluster
The creation process might take a few minutes (maybe up to 5 mins or so, depending on how you configured it.
Because the API to Cloud Formation (the service that does all the orchestration) is asynchronous, we can kill the terminal session if we wanted to and watch the whole show from the AWS console (where you’ll find it all under the “Cloud Formation”dashboard in the events tab for this stack.
$ cfnCluster create boof-cluster
Starting: boof-cluster
Status: cfncluster-boof-cluster - CREATE_COMPLETE Output:"MasterPrivateIP"="10.0.0.17"
Output:"MasterPublicIP"="54.66.174.113"
Output:"GangliaPrivateURL"="http://10.0.0.17/ganglia/"
Output:"GangliaPublicURL"="http://54.66.174.113/ganglia/"
![Page 9: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/9.jpg)
![Page 10: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/10.jpg)
Yes, it’s a real HPC cluster
#cfncluster
Now you have a cluster, probably running CentOS 6.x, with Sun Grid Engine as a default scheduler, and openMPI and a bunch of other stuff installed. You also have a shared filesystem in /shared and an autoscaling group ready to expand the number of compute nodes in the cluster when the existing ones get busy.
You can customize quite a lot via the .cfncluster/config file - check out the comments.
arthur ~ [26] $ cfnCluster create boof-cluster
Starting: boof-cluster
Status: cfncluster-boof-cluster - CREATE_COMPLETE
Output:"MasterPrivateIP"="10.0.0.17"
Output:"MasterPublicIP"="54.66.174.113"
Output:"GangliaPrivateURL"="http://10.0.0.17/ganglia/"
Output:"GangliaPublicURL"="http://54.66.174.113/ganglia/"
arthur ~ [27] $ ssh [email protected]
The authenticity of host '54.66.174.113 (54.66.174.113)' can't be established.
RSA key fingerprint is 45:3e:17:76:1d:01:13:d8:d4:40:1a:74:91:77:73:31.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '54.66.174.113' (RSA) to the list of known hosts.
[ec2-user@ip-10-0-0-17 ~]$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/xvda1 10185764 7022736 2639040 73% /
tmpfs 509312 0 509312 0% /dev/shm
/dev/xvdf 20961280 32928 20928352 1% /shared
[ec2-user@ip-10-0-0-17 ~]$ qhost
HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS
----------------------------------------------------------------------------------------------
global - - - - - - - - - -
ip-10-0-0-136 lx-amd64 8 1 4 8 - 14.6G - 1024.0M -
ip-10-0-0-154 lx-amd64 8 1 4 8 - 14.6G - 1024.0M -
[ec2-user@ip-10-0-0-17 ~]$ qstat
[ec2-user@ip-10-0-0-17 ~]$
[ec2-user@ip-10-0-0-17 ~]$ ed hw.qsub
hw.qsub: No such file or directory
a
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -pe mpi 2
#$ -S /bin/bash
#
module load openmpi-x86_64
mpirun -np 2 hostname
.
w
110
q
[ec2-user@ip-10-0-0-17 ~]$ ll
total 4
-rw-rw-r-- 1 ec2-user ec2-user 110 Feb 1 05:57 hw.qsub
[ec2-user@ip-10-0-0-17 ~]$ qsub hw.qsub
Your job 1 ("hw.qsub") has been submitted
[ec2-user@ip-10-0-0-17 ~]$
[ec2-user@ip-10-0-0-17 ~]$ qstat
job-ID prior name user state submit/start at queue
slots ja-task-ID
---------------------------------------------------------------------------
---------------------
1 0.55500 hw.qsub ec2-user r 02/01/2015 05:57:25 all.q@ip-
10-0-0-44.ap-southeas 2
[ec2-user@ip-10-0-0-17 ~]$ qstat
[ec2-user@ip-10-0-0-17 ~]$ ls -l
total 8
-rw-rw-r-- 1 ec2-user ec2-user 110 Feb 1 05:57 hw.qsub
-rw-r--r-- 1 ec2-user ec2-user 26 Feb 1 05:57 hw.qsub.o1
[ec2-user@ip-10-0-0-17 ~]$ cat hw.qsub.o1
ip-10-0-0-136
ip-10-0-0-154
[ec2-user@ip-10-0-0-17 ~]$
![Page 11: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/11.jpg)
Why do researchers love using AWS?
Time to ScienceAccess research
infrastructure in minutes
Low CostPay-as-you-go pricing
ElasticEasily add or remove capacity
Globally AccessibleEasily Collaborate with
researchers around the world
SecureA collection of tools to
protect data and privacy
ScalableAccess to effectively
limitless capacity
![Page 12: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/12.jpg)
# CPUs
time
The spherical model of owning a supercomputer…
You’re still paying for this, but not using it.
Actual CPU usage
![Page 13: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/13.jpg)
AWS (and the Spot Market)
0.00
1.50
3.00
4.50
6.00
# CPUs
time
Spot Market
Our ultimate space filler.
Spot Instances allow you to name your own price for spare AWS computing capacity.
Great for workloads that aren’t time sensitive, and especially popular in research (hint: it’s really cheap).
![Page 14: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/14.jpg)
Time travel
# CPUs
time
# CPUs
time
Wall clock time: ~1 hour Wall clock time: ~1 week
Cost: equal
![Page 15: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/15.jpg)
The solution
When you only pay for what you use …
• If you’re only able to use your compute, say, 30% of the time, you only pay for that time.
1 Pocket the savings
• Buy chocolate• Buy a spectrometer• Hire a scientist.
2Go faster
• Use 3x the cores to run your jobs at 3x the speed.
3Go Large
• Do 3x the science, or consume 3x the data.
… you have options.
![Page 16: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/16.jpg)
2. WRF as HPC code on AWS
This morning’s precipitationforecast from mmm.ucar.edu
(you are here … )
![Page 17: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/17.jpg)
WRF as HPC code on AWS
NOT SEATTLE
![Page 18: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/18.jpg)
WRF as HPC code on AWS
• Weather forecasts• Hurricane modeling• Climate modeling• Wildfire progression forecast• …• > 30,000 users in 150 countries
www.wrf-model.org
![Page 19: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/19.jpg)
WRF as HPC code on AWS
(From ucar.edu)
• Complex to run (multiscale)• Complex to install!• Vast range of computational requirements
![Page 20: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/20.jpg)
cfnCluster for WRF on AWS
• Define a ‘WRF cluster’ in config• compute_instance_type = c4.8xlarge
• ebs_snapshot_id = snap-570ffb0e
• Max_queue_size = 20
• Scaling performance tests: good speedup to > 1,500 cores
• Ensemble runs: run them all in parallel (HTC)since your cluster expands elastically as needed
• Use “spot market” to save $
HPC nodes
Attach volume containing
WRF installation and parameter sets
![Page 21: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/21.jpg)
CfnCluster execution
• Create the compute cluster:• laptop> cfncluster create –cluster-template wrf MyWRFCluster• << wait 5-10 minutes >>• Connect to the Master node of the compute cluster:• laptop> ssh –i <mykey.pem> [email protected] • Go to the working directory on the Master node and launch the
WRF job:• EC2> cd /codes/WRFV3.7/test/em_real• EC2> mpirun -hostfile /home/ec2-user/hostfile -np 14 -ppn 2
./wrf.exe
![Page 22: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/22.jpg)
Using the AWS platform for WRF
• EC2 compute instances run the WRF workloads.• WRF snapshot: AMI virtual machine image (or EBS disk
volume or AWS Marketplace product or Docker ECS container …) .• CfnCluster: bundles AWS elements in a “WRF HPC
cluster template”.• Input/output data: S3 or EBS or on-prem ….
• Other AWS services to build new apps
![Page 23: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/23.jpg)
Ongoing pilot at UW
• Goal: understand and document performance and cost effectiveness of complete WRF workflows on AWS. Current pilot with U. Washington
• Prototypical workloads:- real-time simulations- research modeling using ensembles- regional climate models (large ensembles)
• “Spot instances” for cost savings
![Page 24: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/24.jpg)
Synergy with Open Data initiatives
• WRF users access to input data as a significant aspect• NEXRAD data archives in S3 with real-time updates
going forward – fast and free access• UNIDATA tools to be AWS capable soon – e.g. ldm
![Page 25: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/25.jpg)
3. Docker WRF at NCAR / RAL(publication in process)
November 12, 2015
John Exby ([email protected])
Josh Hacker ([email protected])
Dave Gill ([email protected])
UCAR/NCARResearch Applications Laboratory / Foothills Lab
![Page 26: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/26.jpg)
Docker WRF at NCAR/RAL
NCAR Docker WRF Goals
• WRF precompiled binaries runs under any docker engine!
• Allows scientific reproducibility and application portability from laptops to clouds. (AWS)
• Develop stable platform for research, case studies, tutorials and classroom curricula.
• Constructed two case studies: 12 hour forecasts of Hurricane Sandy or Katrina (at 40km grid) initialized from NOAA global weather model.
• Beta group testing via (private) repositories hosted on hub.docker.com
![Page 27: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/27.jpg)
Docker WRF at NCAR/RAL
WRF in a boxSimple for new users to launch
![Page 28: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/28.jpg)
Docker WRF at NCAR/RAL
WRF on laptop: docker-machine, docker-compose
• Install docker toolbox v1.90
docker-machine create –driver virtualbox –virtualbox-cpu-count “2” \--virtualbox-memory “4096” –virtualbox-disk-size “14000” default
• $ docker login• $ vi docker-compose.yml defines container images, Sandy data set
• $ docker-compose up downloads container images from Hub and runs WRF
Time for new local VM to instantiate: 1 minuteFull WRF output and graphics completed (macbook pro, 2cpu):
6 minutes 13sec
![Page 29: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/29.jpg)
Docker WRF at NCAR/RAL
WRF on AWS using docker-machinedocker-compose
• Launch AWS EC2 instance via docker-machine on laptop:
docker-machine -D create --driver amazonec2 \--amazonec2-access-key $AWS_ACCESS_KEY_ID \--amazonec2-secret-key $AWS_SECRET_ACCESS_KEY \--amazonec2-vpc-id $AWS_VPC_ID \--amazonec2-region us-west-2 \--amazonec2-instance-type c4.4xlarge \--amazonec2-root-size 20 \--amazonec2-zone b wrf-Large16
• $ vi docker-compose.yml defines container images, Sandy data set
• $ docker-compose up downloads container images from Hub to EC2 and runs WRF
Time for new EC2 to instantiate: 3 minutesFull WRF output results completed (c4.4xlarge, 16cpu): 2min 53 sec
![Page 30: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/30.jpg)
Docker WRF at NCAR/RAL
Hurricane Results in 7 minutes:
![Page 31: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/31.jpg)
If you are an educator …
• Check out AWS educate!• Aws.amazon.com/educate • AWS credits for instructors and students• Platform for curriculum sharing• …
![Page 33: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,](https://reader033.fdocuments.us/reader033/viewer/2022042308/5ed50d79ab6e6a4761035599/html5/thumbnails/33.jpg)
Additional resources…
• aws.amazon.com/big-data• aws.amazon.com/compliance• aws.amazon.com/datasets• aws.amazon.com/grants• aws.amazon.com/genomics• aws.amazon.com/hpc• aws.amazon.com/security• Aws.amazon.com/scico• Aws.amazon.com/educate