Delivering a machine learning course on HPC resources · 2019. 12. 1. · Delivering a machine...

23
Delivering a machine learning course on HPC resources Stefano Bagnasco, Federica Legger, Sara Vallero This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement LHCBIGDATA No 799062

Transcript of Delivering a machine learning course on HPC resources · 2019. 12. 1. · Delivering a machine...

  • Delivering a machine learning course on HPC resources

    Stefano Bagnasco, Federica Legger, Sara Vallero

    This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement LHCBIGDATA No 799062

  • The course● Title: Big Data Science and Machine Learning

    ● Graduate Program in Physics at University of Torino

    ● Academic year 2018-2019:○ Starts in 2 weeks○ 2 CFU, 10 hours (theory+hands-on)○ 7 registered students

    ● Academic year 2019-2020:○ March 2020○ 4 CFU, 16 hours (theory+hands-on)○ Already 2 registered students

    2

  • The Program● Introduction to big data science

    ○ The big data pipeline: state-of-the-art tools and technologies

    ● ML and DL methods: ○ supervised and unsupervised models,○ neural networks

    ● Introduction to computer architecture and parallel computing patterns○ Initiation to OpenMP and MPI (2019-2020)

    ● Parallelisation of ML algorithms on distributed resources○ ML applications on distributed architectures○ Beyond CPUs: GPUs, FPGAs (2019-2020)

    3

  • The aim● Applied ML course:

    ○ Many courses on advanced statistical methods available elsewhere

    ○ Focus on hands-on sessions● Students will

    ○ Familiarise with:■ ML methods and libraries■ Analysis tools ■ Collaborative models■ Container and cloud technologies

    ○ Learn how to■ Optimise ML models■ Tune distributed training■ Work with available resources

    4

  • Hands-on● Python with Jupyter notebooks● Prerequisites: some familiarity with numpy and pandas ● ML libraries

    ○ Day 2: MLlib ■ Gradient Boosting Trees GBT■ Multilayer Perceptron Classifier MCP

    ○ Day 3: Keras■ Sequential model

    ○ Day 4: bigDL■ Sequential model

    ● Coming:○ CUDA○ MPI○ OpenMP

    5

  • ML Input Dataset for hands on● Open HEP dataset @UCI, 7GB (.csv) ● Signal (heavy Higgs) + background● 10M MC events (balanced, 50%:50%)

    ○ 21 low level features ■ pt’s, angles, MET, b-tag, …

    ○ 7 high level features■ Invariant masses (m(jj), m(jjj), …)

    6

    Signal

    Background: ttbar

    Baldi, Sadowski, and Whiteson. “Searching for Exotic Particles in High-energy Physics with Deep Learning.” Nature Communications 5

    https://archive.ics.uci.edu/ml/datasets/HIGGS

    https://archive.ics.uci.edu/ml/datasets/HIGGS

  • Infrastructure: requirements

    7

    ● Commodity hardware (CPUs)● Non-dedicated and heterogeneous resources:

    ○ Bare metal■ 1 x 24 cores, 190 GB RAM■ 4 x 28 cores, 260 GB RAM

    ○ IaaS Cloud (on premises)■ 10 VM, 8 cores, 70 GB RAM

    ● Uniform application/service orchestration layer -> Kubernetes

    ● High-throughput vs. high-performance -> Spark● Distributed datasets -> HDFS● Elasticity: allow to scale up if there are unused

    resources

  • What about HPCs?

    8

    ● HPC = high performance processors + low latency interconnect

    ● HPC clusters are typically managed with a batch system

    ● The OCCAM HPC @University of Torino employs a Cloud like management strategy coupled to lightweight virtualization -> OCCAM facility ○ https://c3s.unito.it/index.php/super-computer

    https://c3s.unito.it/index.php/super-computer

  • The OCCAM supercomputer

    9

    Cloud-like HPC cluster

    APPLICATIONDefined by:

    ● Runtime environment● Resource requirements● Execution model

    VIRTUALIZATIONThe pivotal technologies for the middleware architecture are Linux containers, currently managed with Docker.

    ● package, ship and run distributed application components with guaranteed platform parity across different environments

    ● democratizing virtualization by providing it to developers in a usable, application-focused form

    COMPUTING MODEL

    HPC:batch-like, multi-node workloads using MPI and inter-node communication

    PIPELINES:multi-step data analysis requiring high-memory large single-image nodes

    VIRTUAL WORKSTATION:code execution (e.g. R or ROOT) in a single multicore node, possibly with GPU acceleration

    JUPYTER-HUB:With Spark backend for ML and Big Data workloads

    ON DEMANDAutoscaling

  • Infrastructure: architecture

    10

    OAuth login

    High-class hardware Lower-class hardware Virtual Machines

    Spark Driver

    Spark Driver

    Spark Driver

    Spark Executor

    Spark Executor

    Spark Executor

    Spark Executor

    Spark Executor

    Spark Executor

    Spark Executor

    Spark Executor

    Spark Executor

    Spark Executor

    Kubernetes Control Plane Kubernetes WorkersHDFS (for Datasets)

    CPUs: 216Memory: 1.9 TBHDFS: 2.3 TB

    Network: 1 Gbps

  • Infrastructure: elasticity

    11

    Executors

    Executors

    Executors

    Spark Driver

    Spark Driver

    Spark Driver

    Scale Up

    Farm Operator

    Scale Down

    ● Spark driver continuously scales up to reach the requested number of executors

    ● No static quotas enforced, but a Min number of executors to be granted to each tenant

    ● Custom Kubernetes Operator (alpha version):

    ○ lets tenants occupy all available resources in a FIFO manner

    ○ undeploys exceeding executors only to grant the Min number of resources to all registered tenants

  • Scaling tests

    ● #cores per executor ● #cores per machine ● #cores in

    homogeneous cluster

    ● Strong scaling efficiency = time(1)/(N*time(N))○ N = #cores

    12

    MLLib,GBT MLLib,

    MPC

    BigDL,NN

    Perfect scaling

  • ML models && lessons learned

    13

    Model AUC time # events cores note

    MLLib GBT 82 15m 10M 25 Doesn’t scale

    MLLib MPC - 4 layers, 30 hidden

    units

    74 9m 10M 25 Scales well, can’t build complex models

    Keras Sequential - 1 layer, 100 hidden units

    81 18m 1M 25 No distributed training, cannot process 10M events

    BigDL Sequential - 2 layers, 300 hidden units

    86 3h15m 10M 88 1 core/executor required

  • Summary● Applied ML course for Ph.D students focusing on

    distributed training for ML models

    ● Infrastructure runs on ‘opportunistic’ resources

    ● Architecture can be ‘reused’ on OCCAM

    14

  • Spares

    15

  • Farm Kube Operator

    16

    https://github.com/svallero/farmcontroller

    ● Spark Driver deploys executor Pods with given namespace/label/name (let’s call this triad a selector)

    ● But a Pod is not a scalable Kubernetes Resource (i.e. a Deployment is)

    ● The farm Operator implements two Custom Resource Definitions (CRDs) with their own Controller:

    ○ Farm Resource○ FarmManager Resource

    ● The Farm Operator can be applied to any other app (farm type) with similar features

    ● CAVEAT:○ The Farm app should be resilient to the live removal of executors

    (i.e. Spark, HTCondor)

    https://github.com/svallero/farmcontroller

  • Farm Kube Operator (continued)

    17

    Farm Resource● Collects Pods with given selector

    ● Implements scaledown

    ● Defines a Min number of executors (quota)

    ● Reconciles on selected Pod events

    FarmManager Resource● Reconciles on Farm events● Scales down Farms over quota only

    if some other Farm requests resources and it’s below its quota

    ● Simple algorithm: number of killed pods per Farm is proportional to the number of Pods over the quota (should be improved)

    Farm

    Farm Manager

  • OCCAM

    18

    ● managed using container-based cloud-like technologies

    ● computing applications are run on Virtual Clusters deployed on top of the physical infrastructure

    HPC facility at University of Turin

  • OCCAM SPECS

    19

    2 Management nodes

    ● CPU - 2x Intel® Xeon® Processor E5-2640 v3 8 core 2.6 GHz

    ● RAM - 64GB/2133MHz● DISK - 2x HDD 1Tb Raid0● NET - IB 56Gb + 2x10Gb + 4 x 1GB● FORMFACTOR - 1U

    32 Light nodes

    ● CPU - 2x Intel® Xeon® Processor E5-2680 v3, 12 core 2.5Ghz

    ● RAM - 128GB/2133 (8 x 16 Gb)● DISK - SSD 400GB SATA 1.8 inch.● NET - IB 56Gb + 2x10Gb● FORMFACTOR - high density (4 nodes x RU)

    4 Fat Nodes

    ● CPU - 4x Intel® Xeon® Processor E7-4830 v3 12 core/2.1Ghz● RAM - 768GB/1666MHz (48 x 16Gb) DDR4● DISK - 1 SSD 800GB + 1 HDD 2TB 7200rpm● NET - IB 56Gb + 2x10Gb

    4 GPU Nodes

    ● CPU - 2x Intel® Xeon® Processor E5-2680 v3, 12 core 2.1Ghz● RAM - 128GB/2133 (8 x 16Gb) DDR4● DISK - 1 x SSD 800GB sas 6 Gbps 2.5’’● NET - IB 56Gb + 2x10Gb● GPU - 2 x NVIDIA K40 su PCI-E Gen3 x16

  • Scaling tests #1● Optimize #cores per

    executor● Model: MLLib MCP and GBT,

    1M events ● One machine

    t2-mlwn-01.to.infn.it● In the ‘literature’ #cores

    = 5 is magic number to achieve maximum HDFS throughput

    20

    #core = 5 optimal

    GBT does not scale wellExpected since GBT training is hard to parallelise

  • ● Optimize #executors● Model: MLlib MCP, 1M,

    10M events

    Scaling tests #2

    21

    ● #cores/executor = 5 ● One machine

  • Scaling tests #3● Scaling on homogeneous resources

    ○ bare metal, 4 machines with 56 cores and 260 GB

    22

  • ML models && lessons learned 2

    23

    GBT fast MPC

    GBT slow Keras Sequential