Delivering a machine learning course on HPC resources · 2019. 12. 1. · Delivering a machine...
Transcript of Delivering a machine learning course on HPC resources · 2019. 12. 1. · Delivering a machine...
-
Delivering a machine learning course on HPC resources
Stefano Bagnasco, Federica Legger, Sara Vallero
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement LHCBIGDATA No 799062
-
The course● Title: Big Data Science and Machine Learning
● Graduate Program in Physics at University of Torino
● Academic year 2018-2019:○ Starts in 2 weeks○ 2 CFU, 10 hours (theory+hands-on)○ 7 registered students
● Academic year 2019-2020:○ March 2020○ 4 CFU, 16 hours (theory+hands-on)○ Already 2 registered students
2
-
The Program● Introduction to big data science
○ The big data pipeline: state-of-the-art tools and technologies
● ML and DL methods: ○ supervised and unsupervised models,○ neural networks
● Introduction to computer architecture and parallel computing patterns○ Initiation to OpenMP and MPI (2019-2020)
● Parallelisation of ML algorithms on distributed resources○ ML applications on distributed architectures○ Beyond CPUs: GPUs, FPGAs (2019-2020)
3
-
The aim● Applied ML course:
○ Many courses on advanced statistical methods available elsewhere
○ Focus on hands-on sessions● Students will
○ Familiarise with:■ ML methods and libraries■ Analysis tools ■ Collaborative models■ Container and cloud technologies
○ Learn how to■ Optimise ML models■ Tune distributed training■ Work with available resources
4
-
Hands-on● Python with Jupyter notebooks● Prerequisites: some familiarity with numpy and pandas ● ML libraries
○ Day 2: MLlib ■ Gradient Boosting Trees GBT■ Multilayer Perceptron Classifier MCP
○ Day 3: Keras■ Sequential model
○ Day 4: bigDL■ Sequential model
● Coming:○ CUDA○ MPI○ OpenMP
5
-
ML Input Dataset for hands on● Open HEP dataset @UCI, 7GB (.csv) ● Signal (heavy Higgs) + background● 10M MC events (balanced, 50%:50%)
○ 21 low level features ■ pt’s, angles, MET, b-tag, …
○ 7 high level features■ Invariant masses (m(jj), m(jjj), …)
6
Signal
Background: ttbar
Baldi, Sadowski, and Whiteson. “Searching for Exotic Particles in High-energy Physics with Deep Learning.” Nature Communications 5
https://archive.ics.uci.edu/ml/datasets/HIGGS
https://archive.ics.uci.edu/ml/datasets/HIGGS
-
Infrastructure: requirements
7
● Commodity hardware (CPUs)● Non-dedicated and heterogeneous resources:
○ Bare metal■ 1 x 24 cores, 190 GB RAM■ 4 x 28 cores, 260 GB RAM
○ IaaS Cloud (on premises)■ 10 VM, 8 cores, 70 GB RAM
● Uniform application/service orchestration layer -> Kubernetes
● High-throughput vs. high-performance -> Spark● Distributed datasets -> HDFS● Elasticity: allow to scale up if there are unused
resources
-
What about HPCs?
8
● HPC = high performance processors + low latency interconnect
● HPC clusters are typically managed with a batch system
● The OCCAM HPC @University of Torino employs a Cloud like management strategy coupled to lightweight virtualization -> OCCAM facility ○ https://c3s.unito.it/index.php/super-computer
https://c3s.unito.it/index.php/super-computer
-
The OCCAM supercomputer
9
Cloud-like HPC cluster
APPLICATIONDefined by:
● Runtime environment● Resource requirements● Execution model
VIRTUALIZATIONThe pivotal technologies for the middleware architecture are Linux containers, currently managed with Docker.
● package, ship and run distributed application components with guaranteed platform parity across different environments
● democratizing virtualization by providing it to developers in a usable, application-focused form
COMPUTING MODEL
HPC:batch-like, multi-node workloads using MPI and inter-node communication
PIPELINES:multi-step data analysis requiring high-memory large single-image nodes
VIRTUAL WORKSTATION:code execution (e.g. R or ROOT) in a single multicore node, possibly with GPU acceleration
JUPYTER-HUB:With Spark backend for ML and Big Data workloads
ON DEMANDAutoscaling
-
Infrastructure: architecture
10
OAuth login
High-class hardware Lower-class hardware Virtual Machines
Spark Driver
Spark Driver
Spark Driver
Spark Executor
Spark Executor
Spark Executor
Spark Executor
Spark Executor
Spark Executor
Spark Executor
Spark Executor
Spark Executor
Spark Executor
Kubernetes Control Plane Kubernetes WorkersHDFS (for Datasets)
CPUs: 216Memory: 1.9 TBHDFS: 2.3 TB
Network: 1 Gbps
-
Infrastructure: elasticity
11
Executors
Executors
Executors
Spark Driver
Spark Driver
Spark Driver
Scale Up
Farm Operator
Scale Down
● Spark driver continuously scales up to reach the requested number of executors
● No static quotas enforced, but a Min number of executors to be granted to each tenant
● Custom Kubernetes Operator (alpha version):
○ lets tenants occupy all available resources in a FIFO manner
○ undeploys exceeding executors only to grant the Min number of resources to all registered tenants
-
Scaling tests
● #cores per executor ● #cores per machine ● #cores in
homogeneous cluster
● Strong scaling efficiency = time(1)/(N*time(N))○ N = #cores
12
MLLib,GBT MLLib,
MPC
BigDL,NN
Perfect scaling
-
ML models && lessons learned
13
Model AUC time # events cores note
MLLib GBT 82 15m 10M 25 Doesn’t scale
MLLib MPC - 4 layers, 30 hidden
units
74 9m 10M 25 Scales well, can’t build complex models
Keras Sequential - 1 layer, 100 hidden units
81 18m 1M 25 No distributed training, cannot process 10M events
BigDL Sequential - 2 layers, 300 hidden units
86 3h15m 10M 88 1 core/executor required
-
Summary● Applied ML course for Ph.D students focusing on
distributed training for ML models
● Infrastructure runs on ‘opportunistic’ resources
● Architecture can be ‘reused’ on OCCAM
14
-
Spares
15
-
Farm Kube Operator
16
https://github.com/svallero/farmcontroller
● Spark Driver deploys executor Pods with given namespace/label/name (let’s call this triad a selector)
● But a Pod is not a scalable Kubernetes Resource (i.e. a Deployment is)
● The farm Operator implements two Custom Resource Definitions (CRDs) with their own Controller:
○ Farm Resource○ FarmManager Resource
● The Farm Operator can be applied to any other app (farm type) with similar features
● CAVEAT:○ The Farm app should be resilient to the live removal of executors
(i.e. Spark, HTCondor)
https://github.com/svallero/farmcontroller
-
Farm Kube Operator (continued)
17
Farm Resource● Collects Pods with given selector
● Implements scaledown
● Defines a Min number of executors (quota)
● Reconciles on selected Pod events
FarmManager Resource● Reconciles on Farm events● Scales down Farms over quota only
if some other Farm requests resources and it’s below its quota
● Simple algorithm: number of killed pods per Farm is proportional to the number of Pods over the quota (should be improved)
Farm
Farm Manager
-
OCCAM
18
● managed using container-based cloud-like technologies
● computing applications are run on Virtual Clusters deployed on top of the physical infrastructure
HPC facility at University of Turin
-
OCCAM SPECS
19
2 Management nodes
● CPU - 2x Intel® Xeon® Processor E5-2640 v3 8 core 2.6 GHz
● RAM - 64GB/2133MHz● DISK - 2x HDD 1Tb Raid0● NET - IB 56Gb + 2x10Gb + 4 x 1GB● FORMFACTOR - 1U
32 Light nodes
● CPU - 2x Intel® Xeon® Processor E5-2680 v3, 12 core 2.5Ghz
● RAM - 128GB/2133 (8 x 16 Gb)● DISK - SSD 400GB SATA 1.8 inch.● NET - IB 56Gb + 2x10Gb● FORMFACTOR - high density (4 nodes x RU)
4 Fat Nodes
● CPU - 4x Intel® Xeon® Processor E7-4830 v3 12 core/2.1Ghz● RAM - 768GB/1666MHz (48 x 16Gb) DDR4● DISK - 1 SSD 800GB + 1 HDD 2TB 7200rpm● NET - IB 56Gb + 2x10Gb
4 GPU Nodes
● CPU - 2x Intel® Xeon® Processor E5-2680 v3, 12 core 2.1Ghz● RAM - 128GB/2133 (8 x 16Gb) DDR4● DISK - 1 x SSD 800GB sas 6 Gbps 2.5’’● NET - IB 56Gb + 2x10Gb● GPU - 2 x NVIDIA K40 su PCI-E Gen3 x16
-
Scaling tests #1● Optimize #cores per
executor● Model: MLLib MCP and GBT,
1M events ● One machine
t2-mlwn-01.to.infn.it● In the ‘literature’ #cores
= 5 is magic number to achieve maximum HDFS throughput
20
#core = 5 optimal
GBT does not scale wellExpected since GBT training is hard to parallelise
-
● Optimize #executors● Model: MLlib MCP, 1M,
10M events
Scaling tests #2
21
● #cores/executor = 5 ● One machine
-
Scaling tests #3● Scaling on homogeneous resources
○ bare metal, 4 machines with 56 cores and 260 GB
22
-
ML models && lessons learned 2
23
GBT fast MPC
GBT slow Keras Sequential