High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing...

33
High Performance Computing on AWS: cfnCluster and Weather Research & Forecasting (WRF) as Examples Kevin Jorissen - [email protected] Scientific Computing – Amazon Web Services

Transcript of High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing...

Page 1: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

High Performance Computing on AWS: cfnCluster and Weather Research & Forecasting (WRF) as Examples

Kevin Jorissen - [email protected] Computing – Amazon Web Services

Page 2: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

AWS Cloud is a powerful platform for most science workloads, from modest applications to large HPC and big data.

• What’s good for other industries is good for you, too• Huge ecosystem of services beyond compute and storage• Spend more time doing what only you can do: researchers doing science;

IT solving your users’ challenges; teachers creating learning opportunities

AWS = more science for you

Page 3: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

“The AWS model works when we have the greatest variety of uncoupled workloads all using the cloud. When it works, it drives the cost of computation down to trivial levels so people can concentrate more on their data, their science and their ideas, rather than bothering to worry about infrastructure.

Science is one of the greatest areas of computation and also happens to be the one that can most benefit from that democratization in cost and global accessibility and where we think Amazon can make a huge, really disruptive, impact on the world by participating - which is, at the most basic level, what we are about as a company.”

AWS = more science for you

Page 4: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

AWS Cloud is a powerful platform for most science workloads, from modest applications to large HPC and big data.

• cfnCluster: create a HPC cluster in 10 minutes, with the flexibility of Cloud• WRF: example of a HPC science workload on AWS• WRF: example of a classroom experience supported by AWS

Topics

Page 5: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

1. cfnCluster - provision an HPC cluster in minutes

http://aws.amazon.com/hpc/cfncluster/

10 minutes

• Created in minutes• a parallel cluster with master, compute nodes,

NFS shared disk, and job scheduler (sge, openlava, or torque)

• Choose compute nodes based on your needs• Cluster sizes up and down depending on your

work queue• Stretches your research dollars further• There’s no queue in the cloud• Save any infrastructure configuration as a tem

plate

Page 6: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

Config options to explore …

#cfncluster

Many options, but the most interesting ones immediately are:

# (defaults to t2.micro for default template)

compute_instance_type = t2.micro

# Master Server EC2 instance type

# (defaults to t2.micro for default template

#master_instance_type = t2.micro

# Inital number of EC2 instances to launch as compute nodes in the cluster.

# (defaults to 2 for default template)

#initial_queue_size = 1

# Maximum number of EC2 instances that can be launched in the cluster.

# (defaults to 10 for the default template)

#max_queue_size = 10

# Boolean flag to set autoscaling group to maintain initial size and scale back

# (defaults to false for the default template)

#maintain_initial_size = true

# Cluster scheduler

# (defaults to sge for the default template)

scheduler = sge

# Type of cluster to launch i.e. ondemand or spot

# (defaults to ondemand for the default template)

#cluster_type = ondemand

# Spot price for the ComputeFleet

#spot_price = 0.00

# Cluster placement group. This placement group must already exist.

# (defaults to NONE for the default template)

#placement_group = NONE

t2.micro is tinyc3.4xlarge might be more interesting …

Min & Max size of your

cluster.

Whether to fall back when things

get quietAlso can use ‘openlava’ or

‘torque’Explore the SPOT

market if you want to save money :-)

A placement group will provision your instances very close to each other

on the network.

Page 7: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

System-wide Upgrade from Ivy Bridge to Haswell

#cfncluster

Yes, really :-)

$ ed ~/.cfncluster/config/compute_instance_type/compute_instance_type = c3.8xLarges/c3/c4/pcompute_instance_type = c4.8xLargew949$ cfncluster update boof-cluster

Downgrading is just as easy. Honest.

Page 8: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

Infrastructure as code

#cfncluster

The creation process might take a few minutes (maybe up to 5 mins or so, depending on how you configured it.

Because the API to Cloud Formation (the service that does all the orchestration) is asynchronous, we can kill the terminal session if we wanted to and watch the whole show from the AWS console (where you’ll find it all under the “Cloud Formation”dashboard in the events tab for this stack.

$ cfnCluster create boof-cluster

Starting: boof-cluster

Status: cfncluster-boof-cluster - CREATE_COMPLETE Output:"MasterPrivateIP"="10.0.0.17"

Output:"MasterPublicIP"="54.66.174.113"

Output:"GangliaPrivateURL"="http://10.0.0.17/ganglia/"

Output:"GangliaPublicURL"="http://54.66.174.113/ganglia/"

Page 9: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,
Page 10: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

Yes, it’s a real HPC cluster

#cfncluster

Now you have a cluster, probably running CentOS 6.x, with Sun Grid Engine as a default scheduler, and openMPI and a bunch of other stuff installed. You also have a shared filesystem in /shared and an autoscaling group ready to expand the number of compute nodes in the cluster when the existing ones get busy.

You can customize quite a lot via the .cfncluster/config file - check out the comments.

arthur ~ [26] $ cfnCluster create boof-cluster

Starting: boof-cluster

Status: cfncluster-boof-cluster - CREATE_COMPLETE

Output:"MasterPrivateIP"="10.0.0.17"

Output:"MasterPublicIP"="54.66.174.113"

Output:"GangliaPrivateURL"="http://10.0.0.17/ganglia/"

Output:"GangliaPublicURL"="http://54.66.174.113/ganglia/"

arthur ~ [27] $ ssh [email protected]

The authenticity of host '54.66.174.113 (54.66.174.113)' can't be established.

RSA key fingerprint is 45:3e:17:76:1d:01:13:d8:d4:40:1a:74:91:77:73:31.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added '54.66.174.113' (RSA) to the list of known hosts.

[ec2-user@ip-10-0-0-17 ~]$ df

Filesystem 1K-blocks Used Available Use% Mounted on

/dev/xvda1 10185764 7022736 2639040 73% /

tmpfs 509312 0 509312 0% /dev/shm

/dev/xvdf 20961280 32928 20928352 1% /shared

[ec2-user@ip-10-0-0-17 ~]$ qhost

HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS

----------------------------------------------------------------------------------------------

global - - - - - - - - - -

ip-10-0-0-136 lx-amd64 8 1 4 8 - 14.6G - 1024.0M -

ip-10-0-0-154 lx-amd64 8 1 4 8 - 14.6G - 1024.0M -

[ec2-user@ip-10-0-0-17 ~]$ qstat

[ec2-user@ip-10-0-0-17 ~]$

[ec2-user@ip-10-0-0-17 ~]$ ed hw.qsub

hw.qsub: No such file or directory

a

#!/bin/bash

#

#$ -cwd

#$ -j y

#$ -pe mpi 2

#$ -S /bin/bash

#

module load openmpi-x86_64

mpirun -np 2 hostname

.

w

110

q

[ec2-user@ip-10-0-0-17 ~]$ ll

total 4

-rw-rw-r-- 1 ec2-user ec2-user 110 Feb 1 05:57 hw.qsub

[ec2-user@ip-10-0-0-17 ~]$ qsub hw.qsub

Your job 1 ("hw.qsub") has been submitted

[ec2-user@ip-10-0-0-17 ~]$

[ec2-user@ip-10-0-0-17 ~]$ qstat

job-ID prior name user state submit/start at queue

slots ja-task-ID

---------------------------------------------------------------------------

---------------------

1 0.55500 hw.qsub ec2-user r 02/01/2015 05:57:25 all.q@ip-

10-0-0-44.ap-southeas 2

[ec2-user@ip-10-0-0-17 ~]$ qstat

[ec2-user@ip-10-0-0-17 ~]$ ls -l

total 8

-rw-rw-r-- 1 ec2-user ec2-user 110 Feb 1 05:57 hw.qsub

-rw-r--r-- 1 ec2-user ec2-user 26 Feb 1 05:57 hw.qsub.o1

[ec2-user@ip-10-0-0-17 ~]$ cat hw.qsub.o1

ip-10-0-0-136

ip-10-0-0-154

[ec2-user@ip-10-0-0-17 ~]$

Page 11: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

Why do researchers love using AWS?

Time to ScienceAccess research

infrastructure in minutes

Low CostPay-as-you-go pricing

ElasticEasily add or remove capacity

Globally AccessibleEasily Collaborate with

researchers around the world

SecureA collection of tools to

protect data and privacy

ScalableAccess to effectively

limitless capacity

Page 12: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

# CPUs

time

The spherical model of owning a supercomputer…

You’re still paying for this, but not using it.

Actual CPU usage

Page 13: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

AWS (and the Spot Market)

0.00

1.50

3.00

4.50

6.00

# CPUs

time

Spot Market

Our ultimate space filler.

Spot Instances allow you to name your own price for spare AWS computing capacity.

Great for workloads that aren’t time sensitive, and especially popular in research (hint: it’s really cheap).

Page 14: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

Time travel

# CPUs

time

# CPUs

time

Wall clock time: ~1 hour Wall clock time: ~1 week

Cost: equal

Page 15: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

The solution

When you only pay for what you use …

• If you’re only able to use your compute, say, 30% of the time, you only pay for that time.

1 Pocket the savings

• Buy chocolate• Buy a spectrometer• Hire a scientist.

2Go faster

• Use 3x the cores to run your jobs at 3x the speed.

3Go Large

• Do 3x the science, or consume 3x the data.

… you have options.

Page 16: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

2. WRF as HPC code on AWS

This morning’s precipitationforecast from mmm.ucar.edu

(you are here … )

Page 17: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

WRF as HPC code on AWS

NOT SEATTLE

Page 18: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

WRF as HPC code on AWS

• Weather forecasts• Hurricane modeling• Climate modeling• Wildfire progression forecast• …• > 30,000 users in 150 countries

www.wrf-model.org

Page 19: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

WRF as HPC code on AWS

(From ucar.edu)

• Complex to run (multiscale)• Complex to install!• Vast range of computational requirements

Page 20: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

cfnCluster for WRF on AWS

• Define a ‘WRF cluster’ in config• compute_instance_type = c4.8xlarge

• ebs_snapshot_id = snap-570ffb0e

• Max_queue_size = 20

• Scaling performance tests: good speedup to > 1,500 cores

• Ensemble runs: run them all in parallel (HTC)since your cluster expands elastically as needed

• Use “spot market” to save $

HPC nodes

Attach volume containing

WRF installation and parameter sets

Page 21: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

CfnCluster execution

• Create the compute cluster:• laptop> cfncluster create –cluster-template wrf MyWRFCluster• << wait 5-10 minutes >>• Connect to the Master node of the compute cluster:• laptop> ssh –i <mykey.pem> [email protected] • Go to the working directory on the Master node and launch the

WRF job:• EC2> cd /codes/WRFV3.7/test/em_real• EC2> mpirun -hostfile /home/ec2-user/hostfile -np 14 -ppn 2

./wrf.exe

Page 22: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

Using the AWS platform for WRF

• EC2 compute instances run the WRF workloads.• WRF snapshot: AMI virtual machine image (or EBS disk

volume or AWS Marketplace product or Docker ECS container …) .• CfnCluster: bundles AWS elements in a “WRF HPC

cluster template”.• Input/output data: S3 or EBS or on-prem ….

• Other AWS services to build new apps

Page 23: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

Ongoing pilot at UW

• Goal: understand and document performance and cost effectiveness of complete WRF workflows on AWS. Current pilot with U. Washington

• Prototypical workloads:- real-time simulations- research modeling using ensembles- regional climate models (large ensembles)

• “Spot instances” for cost savings

Page 24: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

Synergy with Open Data initiatives

• WRF users access to input data as a significant aspect• NEXRAD data archives in S3 with real-time updates

going forward – fast and free access• UNIDATA tools to be AWS capable soon – e.g. ldm

Page 25: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

3. Docker WRF at NCAR / RAL(publication in process)

November 12, 2015

John Exby ([email protected])

Josh Hacker ([email protected])

Dave Gill ([email protected])

UCAR/NCARResearch Applications Laboratory / Foothills Lab

Page 26: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

Docker WRF at NCAR/RAL

NCAR Docker WRF Goals

• WRF precompiled binaries runs under any docker engine!

• Allows scientific reproducibility and application portability from laptops to clouds. (AWS)

• Develop stable platform for research, case studies, tutorials and classroom curricula.

• Constructed two case studies: 12 hour forecasts of Hurricane Sandy or Katrina (at 40km grid) initialized from NOAA global weather model.

• Beta group testing via (private) repositories hosted on hub.docker.com

Page 27: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

Docker WRF at NCAR/RAL

WRF in a boxSimple for new users to launch

Page 28: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

Docker WRF at NCAR/RAL

WRF on laptop: docker-machine, docker-compose

• Install docker toolbox v1.90

docker-machine create –driver virtualbox –virtualbox-cpu-count “2” \--virtualbox-memory “4096” –virtualbox-disk-size “14000” default

• $ docker login• $ vi docker-compose.yml defines container images, Sandy data set

• $ docker-compose up downloads container images from Hub and runs WRF

Time for new local VM to instantiate: 1 minuteFull WRF output and graphics completed (macbook pro, 2cpu):

6 minutes 13sec

Page 29: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

Docker WRF at NCAR/RAL

WRF on AWS using docker-machinedocker-compose

• Launch AWS EC2 instance via docker-machine on laptop:

docker-machine -D create --driver amazonec2 \--amazonec2-access-key $AWS_ACCESS_KEY_ID \--amazonec2-secret-key $AWS_SECRET_ACCESS_KEY \--amazonec2-vpc-id $AWS_VPC_ID \--amazonec2-region us-west-2 \--amazonec2-instance-type c4.4xlarge \--amazonec2-root-size 20 \--amazonec2-zone b wrf-Large16

• $ vi docker-compose.yml defines container images, Sandy data set

• $ docker-compose up downloads container images from Hub to EC2 and runs WRF

Time for new EC2 to instantiate: 3 minutesFull WRF output results completed (c4.4xlarge, 16cpu): 2min 53 sec

Page 30: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

Docker WRF at NCAR/RAL

Hurricane Results in 7 minutes:

Page 31: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

If you are an educator …

• Check out AWS educate!• Aws.amazon.com/educate • AWS credits for instructors and students• Platform for curriculum sharing• …

Page 32: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

Thank you!Kevin Jorissen

[email protected]

Page 33: High Performance Computing on AWS: cfnClusterand Weather … · 2018-12-28 · Scientific Computing –Amazon Web Services. AWS Cloud is a powerful platform for most science workloads,

Additional resources…

• aws.amazon.com/big-data• aws.amazon.com/compliance• aws.amazon.com/datasets• aws.amazon.com/grants• aws.amazon.com/genomics• aws.amazon.com/hpc• aws.amazon.com/security• Aws.amazon.com/scico• Aws.amazon.com/educate