Mellanox Smart Interconnect and Roadmap · 2020. 1. 16. · High Performance Interconnect...

26
1 © 2019 Mellanox Technologies | Confidential HPC-AI Advisory Council – Perth, WA Ashrut Ambastha Mellanox Smart Interconnect and Roadmap

Transcript of Mellanox Smart Interconnect and Roadmap · 2020. 1. 16. · High Performance Interconnect...

  • 1© 2019 Mellanox Technologies | Confidential

    HPC-AI Advisory Council – Perth, WAAshrut Ambastha

    Mellanox Smart Interconnect and Roadmap

  • 2© 2019 Mellanox Technologies | Confidential

    InfiniBand Accelerates 6 of Top 10 Supercomputers

    1 2 3 5 8 10

  • 3© 2019 Mellanox Technologies | Confidential

    5

    HDR 200G InfiniBand Wins Next Generation Supercomputers

    1.7 Petaflops2K HDR InfiniBand NodesDragonfly+ Topology

    23.5 Petaflops8K HDR InfiniBand NodesFat-Tree Topology

    3.1 Petaflops1.8K HDR InfiniBand NodesFat-Tree Topology

    62

    1662641.6 Petaflops

    HDR InfiniBand, Hybrid CPU-GPU-FPGA Fat-Tree Topology

  • 4© 2019 Mellanox Technologies | Confidential

    High Performance Interconnect Development

    1995 2000 2010 2020 20252005 2015

    InfiniBandSDR DDR QDR FDR EDR HDR NDR XDR

    Crossbar Seastar Gemini Aries Slingshot

    Myrinet

    QsNet QsNet with Gateway to Ethernet

    InfiniPath TrueScale OmniPath

    First Petaflop Supercomputer

    LANL Roadrunner

    IBM / Mellanox InfiniBand

    First Teraflop Supercomputer

    Sandia ASCI Red

    Intel

    (InfiniBand)(InfiniBand)

  • 5© 2019 Mellanox Technologies | Confidential

    Accelerating All Levels of HPC/AI Frameworks

    GPUDirect

    RDMA

    Network

    Framework

    Communication

    Framework

    Application

    Framework ▪ Data Analysis▪ Configurable Logic

    ▪ SHARP

    ▪ MPI Tag Matching

    ▪ MPI Rendezvous

    ▪ Software Defined Virtual Devices

    ▪ Network Transport Offload

    ▪ RDMA

    ▪ GPU-Direct

    ▪ SHIELD (self-healing network)

  • 6© 2019 Mellanox Technologies | Confidential

    In-Network Computing to Enable Data-Centric Data Centers

    GPU

    CPU

    GPU

    CPU

    GPU

    CPU

    CPU

    GPU

    GPUDirect

    RDMA

    Scalable Hierarchical

    Aggregation and

    Reduction Protocol

    NVMeOverFabrics

    Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale

    Mellanox In-Network Computing and Acceleration Engines

  • 7© 2019 Mellanox Technologies | Confidential

    Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)

  • 8© 2019 Mellanox Technologies | Confidential

    SHARP Allreduce Performance Advantages

    SHARP enables 75% Reduction in Latency

    Providing Scalable Flat Latency

    6

    12

    2 31 4 M

  • 9© 2019 Mellanox Technologies | Confidential

    Oak Ridge National Laboratory – Coral Summit Supercomputer

    SHARP AllReduce Performance Advantages

    ▪ 2K nodes, MPI All Reduce latency, 2KB message size

    SHARP Enables Highest PerformanceScalable Hierarchical Aggregation and

    Reduction Protocol

  • 10© 2019 Mellanox Technologies | Confidential

    SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology

    SHARP Enables Highest PerformanceScalable Hierarchical Aggregation and

    Reduction Protocol

  • 11© 2019 Mellanox Technologies | Confidential

    Performs the Gradient AveragingReplaces all physical parameter serversAccelerate AI Performance

    SHARP Accelerates AI Performance

    The CPU in a parameter server becomes the bottleneck

  • 12© 2019 Mellanox Technologies | Confidential

    SHARP Performance Advantage for AI

    ▪ SHARP provides 16% Performance Increase for deep learning, initial results▪ TensorFlow with Horovod running ResNet50 benchmark, HDR InfiniBand (ConnectX-6, Quantum)

    16%

  • 13© 2019 Mellanox Technologies | Confidential

    Mellanox Accelerates Record-Breaking AI Systems

    NVIDIA DGX SATURNV▪ 124 DGX-1 nodes interconnected by 32 L1 TOR Switches, in 2016

    ▪ Mellanox 36 port EDR L1 and L2 switches, 4 EDR per system

    ▪ Upgraded to 660 NVIDIA DGX-1 V100 Server Nodes, in 2017

    ▪ 5280 V100 GPUs, 660 PetaFLOPS (AI)

    ImageNet training record breakers

    V100 x 1088, EDR InfiniBand

    91.62 %Scaling efficiency

    P100 x 256, EDR InfiniBand

    ~90 %Scaling efficiency

    P100 x 1024, FDR InfiniBand

    80 %Scaling efficiency

  • 14© 2019 Mellanox Technologies | Confidential

    Mellanox Accelerates Record-Breaking AI Systems

    Faster Speed InfiniBand Enabled Superior Scaling for Top-Level AI systems

    80%

    91.62%

    74%

    76%

    78%

    80%

    82%

    84%

    86%

    88%

    90%

    92%

    94%

    GP

    U s

    ca

    ling

    eff

    icie

    ncy

    GPU Scaling Efficiency (ResNet-50)

    Preferred Networks FDR + Chainer Tesla P100 x 1024

    SonyEDR x 2 + Sony NNL Tesla V100 x 1088

    900 s

    291 s

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    Tra

    inin

    g T

    ime

    (se

    con

    ds)

    Training Time (ResNet-50)

    Preferred Networks FDR + Chainer Tesla P100 x 1024

    SonyEDR x 2 + Sony NNL Tesla V100 x 1088

    3.1x Higher

    Performance11.6% Higher

    efficiency

    EDR x 2FDREDR x 2FDR

    Sony broke ImageNet Training Record

    on cluster AI Bridging Cloud Infrastructure (ABCI)

    with 2D-Torus GPU topology in Dec. 2018.

    Nodes are connected with two rails of EDR. Link

    Higher is better Lower is better

    https://nnabla.org/paper/imagenet_in_224sec.pdf

  • 15© 2019 Mellanox Technologies | Confidential

    Adaptive Routing (AR) Performance – ORNL Summit

    ▪ Oak Ridge National Laboratory – Coral Summit supercomputer▪ Bisection bandwidth benchmark, based on mpiGraph▪ Explores the bandwidth between possible MPI process pairs

    ▪ AR results demonstrate an average performance of 96% of the maximum bandwidth measured

    mpiGraph explores the bandwidth between possible MPI process pairs. In the histograms, the single cluster with AR indicates that all pairs achieve nearly maximum bandwidth while single-path static routing has nine clusters as congestion limits bandwidth, negatively impacting overall application performance.

  • 16© 2019 Mellanox Technologies | Confidential

    InfiniBand Congestion Control (2010)

    Without Congetion ConteolWith Congetion Conteol

    Congestion – Throughput loss No congestion – highest throughput!

  • 17© 2019 Mellanox Technologies | Confidential

    For the HPC-AI Cloud Enthusiasts

  • 18© 2019 Mellanox Technologies | Confidential

    It has been around for a while….

  • 19© 2019 Mellanox Technologies | Confidential

    Introducing Mellanox BlueField SmartNIC

    Programmability

    Performance

    Isolation

  • 20© 2019 Mellanox Technologies | Confidential

    BlueField-2 SoC Block Diagram

    ▪ Tile architecture running 8 x Arm ® A72 CPUs▪ SkyMesh™ coherent low-latency interconnect▪ 6MB L3 Last Level Cache▪ Arm frequency : 2GHz - 2.5GHz ▪ Up to 256GB DDR4 @3200MT/s w/ ECC

    ▪ Dual 10➔100Gb/s ports or single 200Gb/s▪ 50Gb/s PAM4 SerDEs

    ▪ Supports both Ethernet and InfiniBand

    ▪ ConnectX-6 Dx controller

    ▪ 1GbE Out-of-Band management port

    ▪ Fully integrated PCIe switchPCIe Gen 4.0 - 16 lanes

    Root Complex or Endpoint

    I2C, USB, DAP, UART

    MgmtPort

    (1GbE)

    ConnectX-6 Dx

    Subsystem

    L2 Cache

    A72 A72

    L2 Cache

    A72 A72

    L2 Cache

    A72 A72

    L2 Cache

    A72 A72

    DD

    R 4

    64

    b + 8

    b 3

    20

    0T/s

    L3 C

    ache

    (6M

    B)

    PCIe Gen 4.0 Switch

    Packet Proc. Packet Proc.

    Application Offload, NVMe-oF, T10-DIF, etc.

    eSwitch Flow Steering / Switching

    RDMA transport RDMA transport

    IPsec/TLS/CT Encrypt/Decrypt

    GMII Security Engines

    RNG

    PubKey

    Secure Boot

    Accelerators

    Deflate/ Inflate

    SHA-2(De-Dup)

    Regular Expression

    GACCDMA

    eMMC,GPIO

    Dual VPI PortsInfiniBand/Ethernet: 1, 10, 25, 50,100, 200G

    Out-of-BandManagement Port

  • 21© 2019 Mellanox Technologies | Confidential

    Functional Isolation with BlueField-2

    ▪ A Computer in-front of a computer

    ▪ Isolation and Offload▪ Infrastructure functions fully implemented in SmartNIC▪ Networking, Security and Storage

    ▪ Functionality runs secure in separate trust domain ▪ Enforces policies on compromised host▪ Host access to SmartNIC can be blocked by hardware

    VMContain

    er

    Bare Met

    al

    SR-IOVVM

    OVS OVS-DPDK

    OvsdbServer

    Controller

    Neutron

    eSwitch and Hardware table

    Network Interface

    Isolation

    Control plane

    Management Port

    ArmHost

  • 22© 2019 Mellanox Technologies | Confidential

    Operating-System

    BlueField Enables SDN in Bare-Metal Clouds

    Limited to no SDN capabilitiesOrchestration through proprietary TOR

    switch vendor pluginsMandates proprietary network driver

    installation in bare-metal host

    ✓Full-featured SDN capabilities✓Full orchestration through upstream

    OpenStack Neutron APIs✓No installation of network driver in bare-

    metal host

    TOR Switch

    NIC

    OpenStack Neutron Mellanox ML2 Plugin OpenStack Neutron

    Operating-System

    TOR Switch

    Mellanox BlueField SmartNIC

    Neutron

    OVS L2

    Agent

    OVS

    Bare-metal Host Bare-metal Host

    Tenant’s Domain

    Providers’ DomainVirtIO

    Network

    Interface

    TOR Switch Networking SDN Integration

  • 23© 2019 Mellanox Technologies | Confidential

    Software-defined Network Accelerated Processing

    Mellanox BlueField NVMe SNAP

    ▪ NVMe SNAP exposes NVMe interface as physical NVMe SSD▪ Implements NVMe-oF in SmartNIC – no NVMe-oF driver required on bare-metal host▪ Leveraging standard NVMe driver which is available on all major OSs

    ▪ Solving cloud storage pain points▪ OS Agnostic▪ Near-local performance▪ Secured, locked down solution▪ Boot from (remote) disk▪ Any Ethernet/InfiniBand wire protocol (NVMe-oF, iSER, iSCSI, proprietary etc)

    ▪ NVMe SNAP + SmartNIC as 2 in 1 ▪ Serving as both smart network adapter and emulating the local-storage▪ 2 in 1 bundle saving even more on CAPEX

  • 24© 2019 Mellanox Technologies | Confidential

    Operating-System

    BlueField Enables Storage Virtualization in Bare-Metal Clouds

    Bound by physical storage capacityNo backup service or limited to local RAIDNo possibility to manage storage resourcesNo migration of resources

    ✓Same flexibility as virtualized storage✓Same performance as local storage✓OS agnostic, only NVMe driver required✓Backed-up in the storage cloud✓Dynamically allocated cloud storage✓Any wire protocol & storage management

    TOR Switch

    NIC

    Remote Storage

    Operating-System

    TOR Switch

    Mellanox BlueField SmartNIC

    Storage

    InitiatorEmulated Storage

    Bare-metal Host Bare-metal Host

    Tenant’s Domain

    Providers’ Domain

    Local Storage

    Local Physical Drive in Bare-metal Host NVMe SNAP Emulation

  • 25© 2019 Mellanox Technologies | Confidential

    Delivering Highest Performance and Scalability

    ▪ Scalable, intelligent, flexible, high performance, end-to-end connectivity solutions

    ▪ Standards-based, supported by large eco-system

    ▪ Supports all compute architectures: x86, Power, ARM, GPU, FPGA etc.

    ▪ Offloading and In-Network Computing architecture

    ▪ Flexible topologies: Fat Tree, Mesh, 3D Torus, Dragonfly+, etc.

    ▪ Converged I/O: compute, storage, management on single fabric

    ▪ Backward and future compatible

    The Future Depends On Smart Interconnect

  • 26© 2019 Mellanox Technologies | Confidential

    Thank You