Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale...

Post on 07-Mar-2020

5 views 0 download

Transcript of Focus on the Science, Not the Server - HPC Advisory …...File system throughput and IOPS scale...

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Adam Hunter,

Solutions Architect, Public Sector ANZ

ahuntera@amazon.com

28th August 2019

Focus on the Science, Not the ServerPawsey Supercomputing & HPC-AI Advisory Council

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

IT’S ABOUT SCIENCE, NOT SERVERS.

#AWSresearchcloud

aws.amazon.com/rcp

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

2

2 2

4

2

1

1

3

7

7

4

9

5

7

6 6

77

4

8

4

Time

(days)C

ore

sTime (days)

8

2

1

9

5

4

53

12

3

6

1

9

4

8

1

2

8

7

7

6

Co

res Data

centre

capacity

limit

* Source: Hyperion Research, 2018

The metric for success should be time-to-results

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

1.1M vCPUs for machine learning

A group of researchers from Clemson University achieved a remarkable milestone while studying topic modeling, an important component of machine learning associated with natural language processing, breaking the record for creating the largest high-performance cluster in the cloud by using more than 1,100,000 vCPUs on Amazon EC2 Spot Instances running in a single AWS Region..

The graph highlights the

elastic, automatic

expansion of resources.

Clemson took advantage of

the new per-second billing

for Amazon EC2 instances.

The vCPU count usage is

comparable to the core

count on the largest

supercomputers in the

world.

Amazon

S3

Provisioning

and workflow

automation

software

Amazon

S3

JOB

SCRIPT

CLOUDY

CLUSTER

APIs

LOGIN SCHEDULE

R

SLURM

AUTO

SCALING

SPOT FLEET

CCQ

S

3 DDB Amazon

VPC

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Cost advantages

On-PremisesCapital Expense Model

Amazon Web Services (AWS)Pay As You Go Model

▪ Use only what you need

▪ Multiple pricing models

▪ High upfront capital cost

▪ High cost of ongoing support

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

RESERVED INSTANCES

ON-DEMAND INSTANCES

Cost Optimization

Spot

Up to 90% savings

by using excess

capacity, charged at

a Spot price that

fluctuates based on

supply and demand

For high-scale, time-

flexible workloads

SPOT

RESERVED INSTANCES

ON-DEMAND INSTANCES

SPOT

Conse rva t i v e

Op t im i zed

RESERVED INSTANCES

ON-DEMAND INSTANCES

SPOT

Opt im i zed w i t h sca l e -ou t (m agn i f y t he peak )

Reserved

Make a low, one-

time payment and

receive a significant

discount on the hour

charge

For committed

utilization

On-Demand

Pay for compute

capacity by the hour,

with per-second

billing and no long-

term commitments

For spiky workloads,

or to define needs

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

A virtually unlimited number of architecture and deployment options to meet the demands of both your users and your

applications.

HPC on AWS

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Life Sciences

All Sorts of HPC Workloads in the Cloud

Financial Services Energy & Geo Sciences

Design &

Engineering

Media &

Entertainment

Autonomous

Vehicles

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Compute: Amazon EC2 C5n

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

C5n Instances with 100 Gbps Networking

▪ The first “network optimized” instances on AWS

▪ Intel Skylake CPUs

▪ Nitro System (hypervisor and ENA)

C5n

With EFA

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Launching c5n:Choose Instance Type

Configure Instance

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Network: Elastic Fabric Adapter

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

What is Elastic Fabric Adapter (EFA)

C5n P3dn

EFAElastic Fabric Adapter,

best for large HPC workloads

Scale tightly-coupled HPC applications on AWS

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

HPC software stack in Amazon Elastic Compute Cloud (Amazon EC2)

Userspace

Kernel

Without EFA With EFA

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

What can EFA do?

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

{

"LaunchTemplateName": "EFA",

"LaunchTemplateData": {

"CpuOptions": {

"CoreCount": 36,

"ThreadsPerCore": 1

},

"ImageId": "ami-XXXXXXXXXXXXXXXXX",

"Placement": {

"GroupName": "cfdinA"

},

"InstanceType": "c5n.18xlarge",

"NetworkInterfaces": [

{

"DeviceIndex" : 0,

"SubnetId": "subnet-XXXXXXXX",

"InterfaceType" : "efa",

"Groups": [

"sg-b2a50ad6"

]

}

]

}

Step 1: Prepare an EFA-enabled Security Group

Step 2: Launch a Temporary Instance

Step 3: Install EFA Software Components

Step 4: Install your HPC Application

Step 5: Create an EFA-enabled AMI

Step 6: Launch EFA-enabled Instances

into a Cluster Placement Group

Step 7: Terminate the Temporary Instance

Getting Started with EFA

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Storage: Amazon FSx Lustre

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Amazon FSx Lustre

Fully managed third-party file systems optimized for a variety of workloads

Fully managed Cost-effectiveHigh

performing

Parallel distributed file system

Massively scalable performance• 100+ GiB/s throughput• Millions of IOPS• Consistent low latencies• Lustre, an open-source parallel file system

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

File system throughput and IOPS scale linearly with storage capacity

Each terabyte (TB) of storage provides 200 MB/second of file system throughput

File systems can scale to hundreds of GB/s and millions of IOPS

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Seamless integration with Amazon S3

Data stored in Amazon S3 is loaded

to Amazon FSx for processing

Output of processing

returned to Amazon S3 for retention

When your workload finishes, simply delete your file system.

Link your Amazon S3 data set to your Amazon FSx for Lustre file system, then….

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Getting Started with FSx Lustre 1

2

3

4

sudo lfs hsm_archive filename5

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Orchestration: AWS ParallelCluster

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

23

AWS ParallelCluster is now a supported open-source service!

Simplifies the deployment of HPC

in the cloud:

▪ SLURM

▪ Grid Engine

▪ Torque

Incorporates the best of AWS HPC

Services:

▪ FSx Lustre

▪ EFA

▪ AWS Batch

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Many new features with AWS ParallelCluster

• Improved cluster scale up and scale down• Filesystem support, including:

FSx LustreRAID Multiple NFS sharesEFS

• Regional expansion:All Standard RegionsChina GovCloud service!

• Improved support for Custom AMI support

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Getting started with AWS ParallelCluster

$ sudo pip install aws-parallelcluster

$ pcluster configure

$ pcluster ssh cfdcluster

$ pcluster list

$ pcluster delete cfdcluster

Try it with:

https://docs.aws.amazon.com/parallelcluster/

https://github.com/aws/aws-parallelcluster

Client / Cloud9

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

AWS Partners: Ronin

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

https://ronin.cloud/

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Operations: AWS Systems Manager

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Secure Operations with AWS Systems Manger

IAM

Shell or CLI

VPC

Compute Nodes

Access Control

Available in all AWS regions including AWS GovCloud (US)

Run Command

Session Manager

• Browser-based shell and CLI for EC2 instances

• No need to open inbound ports or manage SSH keys

• Grant access through IAM

• Session auditing and logging

• Support for AWS PrivateLinkTags

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Getting Started withSession Manager

AWS Systems Manager

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Performance Considerations

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Performance Considerations for HPC on AWS

• Use Real World Test Cases

• Disable Hyperthreading

• Bind processes to cores

• Launch to a placement group

• Use the default clock source on c5/c5n

• Use an up-to-date OS

• Compile for the host (AVX512)

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

AWS Researcher’s handbook

Written by Amazon’s Research Computing community for scientists.

• Explains foundational concepts about how AWS can accelerate time-to-science in the cloud

• Step-by-step best practices for securing your environment to ensure your research data is safe and your privacy is protected

• Tools for budget management that will help you control your spending and limit costs (and preventing any over-runs)

• Catalogue of scientific solutions from partners chosen for their outstanding work with scientists

aws.amazon.com/rcp

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

aws.amazon.com/rcp

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Adam Hunter

Solutions Architect, Public Sector

ahuntera@amazon.com

https://aws.amazon.com/hpc/

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Fluid dynamics – Ansys Fluent

• C4.8xlarge instance type

• 140M cell model

• F1 car CFD benchmark

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Scaling Benchmarks

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Resources used in this study

Archer: Cray XC30 supercomputer • two 2.7 GHz, 12-core Intel E5-2697 v2 (Ivy Bridge)

AWS:• z1d: 4.0 GHZ Intel®Xeon®Scalable Processors; 24 core per instance;

16GB Ram per core; 25 Gigabit network bandwidth

• c5n: 3.0/3.5 GHZ Intel®Xeon®Scalable Processors; 36 core per instance; 5.3 GB Ram per core; 100 Gigabit network bandwidth; New Elastic Fabric Adaptor (EFA) for fast networking

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Methodology

• OpenFOAM v1806 in Double Precision (pimpleFoam)

• Scotch decomposition for solving hierarchical (i.e constant x/y/z loading) for meshing

• SST-DDES Turbulence Model

• ANSA generated 143/280M cell unstructured mesh

• Time Step=5e-4s with five inner iterations

• Preconditioned Conjugant Gradient Linear Solver

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Scaling Results

0.00E+00

2.00E+00

4.00E+00

6.00E+00

8.00E+00

1.00E+01

1.20E+01

1.40E+01

1.60E+01

1.80E+01

0.E+00 2.E+05 4.E+05 6.E+05 8.E+05

se

c/t

ime-s

tep

Cells/core

z1d (medium)

z1d (fine)

c5n (medium)

c5n (fine)

Archer (medium)

Archer (fine)

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Acknowledgement

Dr. Neil Ashton of Oxford University

Stephen Sachs, of AWS