Delivering a machine learning course on HPC resources · 2019. 12. 1. · Delivering a machine...

Delivering a machine learning course on HPC resources

Stefano Bagnasco, Federica Legger, Sara Vallero

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement LHCBIGDATA No 799062

The course● Title: Big Data Science and Machine Learning

● Graduate Program in Physics at University of Torino

● Academic year 2018-2019:○ Starts in 2 weeks○ 2 CFU, 10 hours (theory+hands-on)○ 7 registered students

● Academic year 2019-2020:○ March 2020○ 4 CFU, 16 hours (theory+hands-on)○ Already 2 registered students

2

The Program● Introduction to big data science

○ The big data pipeline: state-of-the-art tools and technologies

● ML and DL methods: ○ supervised and unsupervised models,○ neural networks

● Introduction to computer architecture and parallel computing patterns○ Initiation to OpenMP and MPI (2019-2020)

● Parallelisation of ML algorithms on distributed resources○ ML applications on distributed architectures○ Beyond CPUs: GPUs, FPGAs (2019-2020)

3

The aim● Applied ML course:

○ Many courses on advanced statistical methods available elsewhere

○ Focus on hands-on sessions● Students will

○ Familiarise with:■ ML methods and libraries■ Analysis tools ■ Collaborative models■ Container and cloud technologies

○ Learn how to■ Optimise ML models■ Tune distributed training■ Work with available resources

4

Hands-on● Python with Jupyter notebooks● Prerequisites: some familiarity with numpy and pandas ● ML libraries

○ Day 2: MLlib ■ Gradient Boosting Trees GBT■ Multilayer Perceptron Classifier MCP

○ Day 3: Keras■ Sequential model

○ Day 4: bigDL■ Sequential model

● Coming:○ CUDA○ MPI○ OpenMP

5

ML Input Dataset for hands on● Open HEP dataset @UCI, 7GB (.csv) ● Signal (heavy Higgs) + background● 10M MC events (balanced, 50%:50%)

○ 21 low level features ■ pt’s, angles, MET, b-tag, …

○ 7 high level features■ Invariant masses (m(jj), m(jjj), …)

6

Signal

Background: ttbar

Baldi, Sadowski, and Whiteson. “Searching for Exotic Particles in High-energy Physics with Deep Learning.” Nature Communications 5

https://archive.ics.uci.edu/ml/datasets/HIGGS

https://archive.ics.uci.edu/ml/datasets/HIGGS

Infrastructure: requirements

7

● Commodity hardware (CPUs)● Non-dedicated and heterogeneous resources:

○ Bare metal■ 1 x 24 cores, 190 GB RAM■ 4 x 28 cores, 260 GB RAM

○ IaaS Cloud (on premises)■ 10 VM, 8 cores, 70 GB RAM

● Uniform application/service orchestration layer -> Kubernetes

● High-throughput vs. high-performance -> Spark● Distributed datasets -> HDFS● Elasticity: allow to scale up if there are unused

resources

What about HPCs?

8

● HPC = high performance processors + low latency interconnect

● HPC clusters are typically managed with a batch system

● The OCCAM HPC @University of Torino employs a Cloud like management strategy coupled to lightweight virtualization -> OCCAM facility ○ https://c3s.unito.it/index.php/super-computer

https://c3s.unito.it/index.php/super-computer

The OCCAM supercomputer

9

Cloud-like HPC cluster

APPLICATIONDefined by:

● Runtime environment● Resource requirements● Execution model

VIRTUALIZATIONThe pivotal technologies for the middleware architecture are Linux containers, currently managed with Docker.

● package, ship and run distributed application components with guaranteed platform parity across different environments

● democratizing virtualization by providing it to developers in a usable, application-focused form

COMPUTING MODEL

HPC:batch-like, multi-node workloads using MPI and inter-node communication

PIPELINES:multi-step data analysis requiring high-memory large single-image nodes

VIRTUAL WORKSTATION:code execution (e.g. R or ROOT) in a single multicore node, possibly with GPU acceleration

JUPYTER-HUB:With Spark backend for ML and Big Data workloads

ON DEMANDAutoscaling

Infrastructure: architecture

10

OAuth login

High-class hardware Lower-class hardware Virtual Machines

Spark Driver

Spark Driver

Spark Driver

Spark Executor

Spark Executor

Spark Executor

Spark Executor

Spark Executor

Spark Executor

Spark Executor

Spark Executor

Spark Executor

Spark Executor

Kubernetes Control Plane Kubernetes WorkersHDFS (for Datasets)

CPUs: 216Memory: 1.9 TBHDFS: 2.3 TB

Network: 1 Gbps

Infrastructure: elasticity

11

Executors

Executors

Executors

Spark Driver

Spark Driver

Spark Driver

Scale Up

Farm Operator

Scale Down

● Spark driver continuously scales up to reach the requested number of executors

● No static quotas enforced, but a Min number of executors to be granted to each tenant

● Custom Kubernetes Operator (alpha version):

○ lets tenants occupy all available resources in a FIFO manner

○ undeploys exceeding executors only to grant the Min number of resources to all registered tenants

Scaling tests

● #cores per executor ● #cores per machine ● #cores in

homogeneous cluster

● Strong scaling efficiency = time(1)/(N*time(N))○ N = #cores

12

MLLib,GBT MLLib,

MPC

BigDL,NN

Perfect scaling

ML models && lessons learned

13

Model AUC time # events cores note

MLLib GBT 82 15m 10M 25 Doesn’t scale

MLLib MPC - 4 layers, 30 hidden

units

74 9m 10M 25 Scales well, can’t build complex models

Keras Sequential - 1 layer, 100 hidden units

81 18m 1M 25 No distributed training, cannot process 10M events

BigDL Sequential - 2 layers, 300 hidden units

86 3h15m 10M 88 1 core/executor required

Summary● Applied ML course for Ph.D students focusing on

distributed training for ML models

● Infrastructure runs on ‘opportunistic’ resources

● Architecture can be ‘reused’ on OCCAM

14

Spares

15

Farm Kube Operator

16

https://github.com/svallero/farmcontroller

● Spark Driver deploys executor Pods with given namespace/label/name (let’s call this triad a selector)

● But a Pod is not a scalable Kubernetes Resource (i.e. a Deployment is)

● The farm Operator implements two Custom Resource Definitions (CRDs) with their own Controller:

○ Farm Resource○ FarmManager Resource

● The Farm Operator can be applied to any other app (farm type) with similar features

● CAVEAT:○ The Farm app should be resilient to the live removal of executors

(i.e. Spark, HTCondor)

https://github.com/svallero/farmcontroller

Farm Kube Operator (continued)

17

Farm Resource● Collects Pods with given selector

● Implements scaledown

● Defines a Min number of executors (quota)

● Reconciles on selected Pod events

FarmManager Resource● Reconciles on Farm events● Scales down Farms over quota only

if some other Farm requests resources and it’s below its quota

● Simple algorithm: number of killed pods per Farm is proportional to the number of Pods over the quota (should be improved)

Farm

Farm Manager

OCCAM

18

● managed using container-based cloud-like technologies

● computing applications are run on Virtual Clusters deployed on top of the physical infrastructure

HPC facility at University of Turin

OCCAM SPECS

19

2 Management nodes

● CPU - 2x Intel® Xeon® Processor E5-2640 v3 8 core 2.6 GHz

● RAM - 64GB/2133MHz● DISK - 2x HDD 1Tb Raid0● NET - IB 56Gb + 2x10Gb + 4 x 1GB● FORMFACTOR - 1U

32 Light nodes

● CPU - 2x Intel® Xeon® Processor E5-2680 v3, 12 core 2.5Ghz

● RAM - 128GB/2133 (8 x 16 Gb)● DISK - SSD 400GB SATA 1.8 inch.● NET - IB 56Gb + 2x10Gb● FORMFACTOR - high density (4 nodes x RU)

4 Fat Nodes

● CPU - 4x Intel® Xeon® Processor E7-4830 v3 12 core/2.1Ghz● RAM - 768GB/1666MHz (48 x 16Gb) DDR4● DISK - 1 SSD 800GB + 1 HDD 2TB 7200rpm● NET - IB 56Gb + 2x10Gb

4 GPU Nodes

● CPU - 2x Intel® Xeon® Processor E5-2680 v3, 12 core 2.1Ghz● RAM - 128GB/2133 (8 x 16Gb) DDR4● DISK - 1 x SSD 800GB sas 6 Gbps 2.5’’● NET - IB 56Gb + 2x10Gb● GPU - 2 x NVIDIA K40 su PCI-E Gen3 x16

Scaling tests #1● Optimize #cores per

executor● Model: MLLib MCP and GBT,

1M events ● One machine

t2-mlwn-01.to.infn.it● In the ‘literature’ #cores

= 5 is magic number to achieve maximum HDFS throughput

20

#core = 5 optimal

GBT does not scale wellExpected since GBT training is hard to parallelise

● Optimize #executors● Model: MLlib MCP, 1M,

10M events

Scaling tests #2

21

● #cores/executor = 5 ● One machine

Scaling tests #3● Scaling on homogeneous resources

○ bare metal, 4 machines with 56 cores and 260 GB

22

ML models && lessons learned 2

23

GBT fast MPC

GBT slow Keras Sequential

Delivering a machine learning course on HPC resources · 2019. 12. 1. · Delivering a machine...

Documents

Transcript of Delivering a machine learning course on HPC resources · 2019. 12. 1. · Delivering a machine...