Marcin Okoń Pozna ń Supercomputing and Networking Center, Supercomputing Department
Self-Service Supercomputing
-
Upload
amazon-web-services -
Category
Technology
-
view
868 -
download
0
Transcript of Self-Service Supercomputing
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
London Summit July 2016
HPC Clusters as code in the [almost]* Infinite cloud
Brendan BoufflerAWS Global Scientific Computing
@boofla
2016-07-07
Wil MayersAlces Flight Ltd (UK)
@alcesflight
Scientific Computing
Science is one of the greatest areas ofcomputation and can benefit from ademocratization in cost and globalaccessibility that the cloud brings.
It’s also where we think Amazon canmake a huge, really disruptive, impacton the world by participating - which is, atthe most basic level, what we are aboutas a company.
Disrupting science, wherever it’s happening.
Existing1. Oregon2. California3. Virginia4. Dublin5. Frankfurt6. Singapore7. Sydney8. Seoul9. Tokyo10. Sao Paulo11. Beijng12. US GovCloud
1. Ohio2. India3. UK4. Canada5. China+1
AWS Region Availability Zone
regions are sovereign your data never leaves
Public Data Sets
workloads to the data data to the workloads
Meeeeelions of uncorrelated workloads
core
s
time
Collectiveaction
Wheneveryonecomestogetherinthecloudtosharetheresource,andonlypaysforwhattheyuse,theefficiencyishuge.
Spot Market
core
s
time
Spot Market
Our ultimate space filler.
Spot Instances allow you to name your own price for spare AWS EC2 computing capacity.
Great for workloads that aren’t time sensitive, and especially popular in research (hint: it’s really cheap).
Spot Market BehaviorSpot Bid Advisor
The Spot Bid Advisor analyzes Spot price history to help you determine a bid price that suits your needs.
You should weigh your application’s tolerance for interruption and your cost saving goals when selecting a Spot instance and bid price.
The lower your frequency of being outbid, the longer your Spot instances are likely to run without interruption.
https://aws.amazon.com/ec2/spot/bid-advisor/
Bid Price & Savings
Your bid price affects your ranking when it comes to acquiring resources in the SPOT market, and is the maximum price you will pay.
But frequently you’ll pay a lot less.
Agility is…Paying Only for IT You Use
Peak: 58K cores
Valley: 12K cores
Breakthrough discoveries in the Cloud
The CHILES project astronomers have detected radio emissions from hydrogen in a galaxy more than 5 billion light years away, shattering the previous record by almost twice. This has important implications for our understanding of how galaxies have evolved over time.
The team at ICRAR in Western Australia estimates that the amount of compute capacity required to shift and crunch this data would have made this work infeasible.
By using AWS, they were able to quickly and cheaply build their new pipelines, and then scale them as massive amounts of data arrived from their instruments.
Science is about experimentation
AWS Building blocks
TECHNICAL & BUSINESS SUPPORT
Account Management
Support
Professional Services
Solutions Architects
Training & Certification
Security & Pricing Reports
Partner Ecosystem
AWSMARKETPLACE
Backup
Big Data& HPC
Business Apps
Databases
Development
IndustrySolutions
Security
MANAGEMENTTOOLS
Queuing
Notifications
Search
Orchestration
ENTERPRISEAPPS
VirtualDesktops
StorageGateway
Sharing &Collaboration
Email &Calendaring
Directories
HYBRID CLOUDMANAGEMENT
Backups
Deployment
DirectConnect
IdentityFederation
IntegratedManagement
SECURITY &MANAGEMENT
Virtual PrivateNetworks
Identity &Access
EncryptionKeys
Configuration Monitoring Dedicated
INFRASTRUCTURESERVICES
Regions AvailabilityZones Compute
StorageObjects, Blocks, Files
DatabasesSQL, NoSQL, Caching
CDNNetworking
PLATFORMSERVICES
App
Mobile & WebFront-end
Functions
Identity
Data Store
Real-time
Development
Containers
SourceCode
BuildTools
Deployment
DevOps
Mobile
Sync
Identity
PushNotifications
MobileAnalytics
MobileBackend
Analytics
DataWarehousing
Hadoop
Streaming
DataPipelines
MachineLearning
EC2There’s a couple dozen EC2 compute instance types alone, each of which is optimized for different things.
One size does not fit all.
C4Intel Xeon E5-2666 v3, custom built for AWS.
Intel Haswell, 16 FLOPS/tick
2.9 GHz, turbo to 3.5 GHz
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/c4-instances.html
Feature Specification
Processor Number E5-2666 v3
Intel® Smart Cache 25 MiB
Instruction Set 64-bit
Instruction Set Extensions AVX 2.0
Lithography 22 nm
Processor Base Frequency 2.9 GHz
Max All Core Turbo Frequency 3.2 GHz
Max Turbo Frequency 3.5 GHz (available on c4.2xLarge)
Intel® Turbo Boost Technology 2.0
Intel® vPro Technology Yes
Intel® Hyper-Threading Technology Yes
Intel® Virtualization Technology (VT-x) Yes
Intel® Virtualization Technology for Directed
I/O (VT-d)
Yes
Intel® VT-x with Extended Page Tables (EPT) Yes
Intel® 64 Yes
cfnCluster - provision an HPC cluster in minutes
#cfnclusterhttps://github.com/awslabs/cfncluster
cfncluster is a sample code framework that deploys and maintains clusters on AWS. It is reasonably agnostic to what the cluster is for and can easily be extended to support different frameworks. The CLI is stateless, everything is done using CloudFormation or resources within AWS.
10 minutes
http://boofla.io/u/cfnCluster – (Boof’s HOWTO slides)
§ 750+ popular scientific applications
AWS Marketplace
iimmediately
Introducing Alces Flight - self-scaling HPC clusters instantly ready to compute, billed by the hour and using the AWS Spot market by default to achieve supercomputing for ~1c per core per hour.
Self-service HPC … 2016
http://boofla.io/u/alcesFlight
Requirements for Launching your HPC cluster
• An Amazon Web Services (AWS) account• An SSH key-pair in your AWS region• An SSH client• Optionally – a VNC client• A workload to process
Wil Mayers, Alces
Searching AWS Marketplace
Selecting Alces Flight from Marketplace
Launching a new cluster
CloudFormation cluster launch
Access IP address
Logging in to your Flight Cluster
Cluster Architecture VPC
• Virtual Private Cluster (VPC)• One login node• EBS volume for data/apps• Compute node scaling group
• 2 to 1,152 cores• Deployed in placement group• Static or auto-scaling• On-demand or Spot instances
Linux cluster facilities
• CentOS Linux cluster• Full root access to all nodes• Genders utility • PDSH utility• YUM install any software
Graphical Desktop sessions
• Create a session• Share connection
details• Join to the session via
VNC• Other collaborators
can join
Using Graphical Applications
Installing Scientific Applications
• Simple command-line tool to install applications
Installing by Scientific Discipline
• Choose a depot of applications to install
Alces Gridware Application library
• Over 850 application, library and MPI versions• Pre-optimized and stored in S3• Option to compile and optimize on-demand
• Includes modules environment management• Gridware project keeps pace with latest versions• Support for commercial and licensed applications• http://tiny.cc/gridware
Using Storage Services
• Cluster includes large storage volume for data and apps
• Tools to manage data held in object storage
• Store your data in AWS S3 quickly and easily
S3
Cluster job scheduler
• Choice of HPC cluster job schedulers
• Automate job processing on your HPC cluster
• Queue jobs for processing when nodes are available
• Auto-scaling compute nodes within user-defined limits
• Automatically rerun any jobs stopped when spot price exceeded
Workload to process #1
Landsat cloud coverage survey
Landsat Satellite mapping data
• Continuous record of Earth’s surface
• Data from the 1970s to present day
• Public data set available to everyone
• Stored on object storage, including AWS S3
Workload
• Survey of cloud cover around Northern Tropic• Task-array job running 360 degrees around the Earth• Measures average cloud cover in each image• Generates a deck of sample images• Uploads deck to S3 object storage• Uses 360 x compute cores
? S3
Workflow
1. Launch your cluster2. Enable object storage3. Install application4. Fetch job-script5. Submit job
Approximate costs
• 360 jobs each taking ~5 mins• Total CPU time = 30 core hours
• Cost of 36 core hours in AWS spot market* = $0.44• Cost of one T2 login node for 1 hour* = $0.12• Cost of 100GB EBS volume for apps* = <$0.01• Alces Flight software cost = $0.00
• Total cost per daily run = $0.60 / 45p• Cost for one year of research = $219 / £168
* based on C4.8xlarge spot rate in EU-West region; T2.large on-demand instance; EBS st1 volume; excludes S3 storage costs and sales tax where applicable
Workload to process #2
Computational Fluid Design with OpenFoam
OpenFoam CFD
• Computational Fluid Design workload• Simulates liquid and air-flow for engineering projects• Open-source software available to all• Commercial support available from CFD Direct Ltd.• Run as a parallel job across multiple compute nodes
Workload
• Generate a mesh representing the problem• Decomposition of the problem into sections• Processing of the sections• Visualization of the solution
Workflow
1. Launch your cluster2. Enable object storage3. Install application4. Fetch job-script5. Submit job6. Start desktop7. Visualize
Visualization with ParaView
Approximate costs (full solve)
• 1 job using 128 cores taking 4 hours• Total CPU time = 1024 core hours
• Cost of 1024 core hours in AWS spot market* = $7.04• Cost of one T2 login node for 4 hours* = $0.45• Cost of 100GB EBS volume for apps* = $0.02• Alces Flight software cost = $0.00
• Total cost per simulation = $7.51 / £5.75
* based on C4.8xlarge spot rate in EU-West region; T2.large on-demand instance; EBS st1 volume; excludes sales tax where applicable
Filesystems in the marketplace, too
BeeGFS is a scalable parallel cluster filesystem developed with a strong focus on performance and designed easy installation and management developed by the Fraunhofer Institute.
Intel Lustre® Cloud Edition is a scalable, parallel file system purpose-built for HPC and with a long history in the field supporting a range of workloads.
There’s more to come - the AWS Marketplace is growing all the time and new offerings are added frequently. Watch this space.
There are cluster filesystem options, too– for when you need extreme I/O scaling.
How to start?
1. AWS Account
3. A problem to solve
Please remember to rate this session under My Agenda on
awssummit.london