John Garbutt, StackHPC Machine Learning on Feb 2021 How ...

Machine Learning on OpenStack

How can Scientific OpenStack help?

Feb 2021John Garbutt, StackHPC

StackHPC Company Overview

● Formed 2016, based in Bristol, UK○ Based in Bristol with presence in Cambridge, France and Poland○ Currently 16 people

● Founded on HPC expertise ○ Software Defined Networking○ Systems Integration○ OpenStack Development and Operations

● Motivation to transfer this expertise into Cloud to address HPC & HPDA● “Open” Modus Operandi

○ Upstream development of OpenStack capabilities○ Consultancy/Support to end-user organizations in managing HPC service transition○ Scientific-SIG engagement for the Open Infrastructure Foundation○ Open Source, Design, Development and Community

What is needed forMachine Learning?

Training, Inference, Data and more

https://developers.google.com/machine-learning/crash-course/production-ml-systems

https://developers.google.com/machine-learning/crash-course/production-ml-systems

TensorFlow Extended (TFX)

https://github.com/GoogleCloudPlatform/tf-estimator-tutorials/blob/54c3099d3a687052bd463e1344a8836913ac2d26/00_Miscellaneous/tfx/02_tfx_end_to_end.ipynb


TensorFlow Extended (TFX)


Transform Data Model Training

Inference


Machine Learning Breakdown● Data Processing and Pipelines

○ Transform to extract Features and Labels○ Data Visualization

● Training: Static vs Dynamic Model Training○ Does input data change over time○ Pipeline reproducibility

● Inference: Offline vs Online predictions○ Regression or Classification have similar questions○ Decision latency can be critical, with the need to use more resources to get it faster

● Model complexity○ Linear, Non-linear, how deep, how wide, ...

● Flow: Dev -> Stage -> Prod● MLOps: Configuration Management, Deployment tooling, Monitoring...

https://www.kubeflow.org/docs/started/kubeflow-overview/#introducing-the-ml-workflow

https://www.kubeflow.org/docs/started/kubeflow-overview/#introducing-the-ml-workflow

Infrastructure Requests● Offline can fit batch (e.g. Slurm), but not online

○ Offline Training and Online Inference: you may want a mix?

● Scale up○ CPUs not always the best price per performance○ GPUs often better. Also IPUs, Tensor cores, new CPU instructions○ Connect to disaggregated accelerator and/or storage

● Scale out○ Distribute work, via RDMA low latency interconnect

● High Performance Storage○ Keep the model processing fed, share transformed data○ RDMA enabled low latency access to data sets

● Monitoring to understand how your chosen model performs

Scientific OpenStack

HPC Stack 1.0

Motivations Driving a Change

● Manage the increasing complexity

● Better knowledge sharing

● Move away from Resource Silos

HPC Stack 1.0

HPC Stack 2.0

OpenStack Magnum● Kubernetes clusters on demand

○ … working to add support for K8s v1.20○ Terraform and Ansible can be used to manage clusters

● Magnum Cluster Autoscaler○ Automatically expands and shrinks K8s cluster, within defined limits○ Based on when pods can / can’t be scheduled

● Storage Integration○ Cinder CSI for Volumes (ReadWriteOnce PVC)○ Manila CSI for CephFS shares (ReadWriteMany PVC)

● Network Integration○ Octavia load balancer as a service

https://github.com/RSE-Cambridge/iris-magnum


OpenStack GPUs

GPUs in OpenStack● Ironic

○ Full access to hardware, including all GPUs and Networking

● Virtual Machine with PCI Passthrough○ Share out single physical machine○ Flavors with one or multiple GPUs○ Some protection of GPU firmware○ … restrictions around using data centre GPUs

● Virtual Machine with vGPU○ Typical vGPU requires expensive licences, created○ Time Slicing GPU created for VDI○ Depends if your workloads can saturate a full GPU○ … but A100 includes MIG (multi-instance GPU)

● Some GPU features need RDMA networking

GPU Resource Management● GPUs are expensive, need careful management to get a good ROI● Batch Queue System e.g. Slurm

○ Sharing via batch jobs can be very efficient○ … but not great for dynamic training and online inference

● OpenStack Quotas○ Today no real support for GPU quota○ … but flavors that request GPUs can be limited to projects○ and projects can be limited to specific subsets of hardware

● Reservations and Pre-emptables○ Scientific OpenStack is looking at OpenStack Blazar○ Projects can reserve resources ahead of time○ Option to use pre-emptables to scale out when resources are free○ … to stop people hanging on to GPUs and not using them

● Get in touch if you are interested in shaping this work

https://unsplash.com/photos/oBbTc1VoT-0

https://unsplash.com/photos/oBbTc1VoT-0

OpenStack and RDMA(Remote Direct Memory Access)

RDMA with OpenStack● Ethernet with RoCEv2, also Infiniband

○ Ethernet switches can use ECN and PFC to reduce packet loss, larger MTU

● SR-IOV○ Hardware physical function (PF) mapped to multiple virtual functions (VF)○ Hardware configured to limit traffic to VLANs (and sometimes overlays)○ … typically no security groups, bonding possible with some NICs, QoS possible○ https://www.stackhpc.com/vxlan-ovs-bandwidth.html and https://www.stackhpc.com/sriov-kayobe.html

● Virtual Machine runs drivers for the specific NIC○ … ignoring mdev for now

● Live-migration with SR-IOV○ Makes use of hot unplug then hot plug○ Possible bond with a virtual NIC, breaks RDMA○ … in future mdev may help, but not today

https://www.stackhpc.com/vxlan-ovs-bandwidth.html

https://www.stackhpc.com/sriov-kayobe.html

Kubernetes RDMA with OpenStack● Some Pods need RDMA enabled networking● Kubernetes in VM with VF passthrough

○ OpenStack controls network isolation○ Pods forced to use host networking to get RDMA○ … not so bad if one Pod per VM, with Magnum cluster autoscaler

● Kubernetes in VM with PF passthrough○ Kubelet manages virtual function passthrough to pods○ OpenStack maps devices to physical networks○ Switch port could be out of band configured to restrict allowed VLANs○ NIC that is passed through, typically can’t be used by host or any other VM

● Kubernetes deployed on Ironic server○ Similar to PF passthrough○ … but Neutron could orchestrate the switch port configuration

RDMA Remote Storage● OpenStack supports fast local storage

○ Ratio typically fixed with CPU and RAM, but workload needs vary○ Remote storage, like Ceph RBD, can have a very high latency

● OpenStack supports provider VLANs○ Can be shared with a select group of projects○ You can have a neutron router onto the network, if required

● Shared Storage can be a workload in OpenStack○ Examples: Lustre, BeeGFS○ Run on baremetal or VMs with RDMA enabled○ Provide a shared file system to stage data into

● External appliances can be accessed via provider VLANs○ Some storage can integrate with OpenStack Manila for Filesystem as a Service

Example workload:Monitoring Slurm

Example workload:Horovod Benchmarks

● Distributed deep learning framework● Supported by LF AI & Data Foundation● https://github.com/horovod/horovod

● P3 AlaSKA○ TCP 10GbE○ RoCE 25GbE○ Infiniband 100Gb○ 2 GPU nodes, each with 4 x P100 GPUs

● ResNET 50 Benchmark○ All tests use 8 P100 GPUs○ On baremetal, using OpenStack Ironic○ Horovod on k8s with tensorflow, openmpi○ Note: higher is better

Horovod on P3 AlaSKA

https://lfaidata.foundation/

https://github.com/horovod/horovod

Horovod

https://github.com/horovod/horovod/blob/master/docs/benchmarks.rst

https://github.com/horovod/horovod/blob/master/docs/benchmarks.rst

Example hardware:NVIDIA DGX A100

NVIDIA DGX A100● 200Gb/s ConnectX-6 for each GPU● Local NVMe to cache training data● NVIDIA NVLink

○ DGX A100 has 6 NVIDIA NVSwitch fabrics○ Each A100 GPU uses twelve NVLink

interconnects, two to each NVSwitch○ GPU-to-GPU communication 600 GB/s○ All to all peak of 4.8 TB/s in both directions

● A100 has Multi-Instance GPU (MIG)○ Up to 7 MIGs per A100○ MIG GPU instance has its own memory,

cache, and streaming multiprocessor○ Multiple users can share the same GPU

and run all instances simultaneously, maximizing GPU efficiency

https://developer.nvidia.com/blog/defining-ai-innovation-with-dgx-a100/

https://developer.nvidia.com/blog/defining-ai-innovation-with-dgx-a100/

Next Stepswith Scientific OpenStack

Scientific OpenStack Digital Assets● Existing assets

○ Reference OpenStack architecture, configuration and operational tooling○ Reference platforms and workloads, such as:

■ https://github.com/RSE-Cambridge/iris-magnum

● Edinburgh Institute of Astronomy○ 2nd IRIS site starting to adopt Scientific OpenStack Digital Assets

● SKA are funding moving P3 AlaSKA into IRIS● Updates in March 2021 due to include:

○ GPU and SR-IOV best practice guides, using P3 AlaSKA hardware○ Magnum updated to support Kubernetes v1.20○ Improved Resource Management via Blazar○ Prometheus based Utilization Monitoring○ Assessment of porting JASMIN Cluster as a Service to IRIS


Summary of ML on OpenStack● ML represents a diverse set of workloads

○ … with a corresponding diverse set of Infrastructure needs○ Latency sensitive Online inference vs Offline training, Regression vs Classification, etc○ Wide variety of model complexity and size of data inputs

● Broad ecosystem of tools and platforms○ Many assume Kubernetes is available for deployment○ OpenStack can provide the resources needed, Kubernetes or not

● Challenging to use resources efficiently○ No generic “best fit” mix of Compute, Networking and Storage○ GPUs, Tensor cores, IPUs, FPGAs can be more efficient than CPUs○ Storage and Networking need to keep processing fed○ Demand for Online Inference (and training) can be hard to predict

Questions?

John Garbutt, StackHPC Machine Learning on Feb 2021 How ...

Documents

Transcript of John Garbutt, StackHPC Machine Learning on Feb 2021 How ...