SR-IOV: Performance Benefits for Virtualized Interconnects

17
SAN DIEGO SUPERCOMPUTER CENTER SR-IOV: Performance Benefits for Virtualized Interconnects Glenn K. Lockwood Mahidhar Tatineni Rick Wagner July 15, XSEDE14, Atlanta

Transcript of SR-IOV: Performance Benefits for Virtualized Interconnects

Page 1: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

SR-IOV: Performance Benefits for Virtualized Interconnects!

Glenn K. Lockwood!Mahidhar Tatineni!

Rick Wagner!!

July 15, XSEDE14, Atlanta!

Page 2: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

Background!•  High Performance Computing (HPC) reaching beyond traditional

application areas. Need for increased flexibility from HPC cyberinfrastructure."

•  Increasingly jobs on XSEDE resources are originating from web-portals/science gateways. "

•  Science gateways and user communities can develop virtual "compute appliances" that contain tightly integrated application stacks that can be deployed in a hardware-agnostic fashion.!

•  Several vendors also provide software packaged in appliances to enable easy deployment."

•  Such benefits have been an important driving force behind compute clouds such as Amazon Web Services (AWS) EC2." 2

Page 3: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

Background!•  The network bandwidth and latency performance of virtualized systems

traditionally has been markedly worse than that of the native hardware."•  Hardware vendors have been providing an increased support for

virtualization within hardware such as the I/O memory management units (IOMMU). "

•  With the standardization and adoption of technologies such as Single-Root I/O Virtualization (SR-IOV) in network device hardware, a road has been paved towards truly high-performance virtualization for high-performance computing applications.!

•  These technologies will be available to XSEDE users via future computing resources (Comet at SDSC: Production starting early 2015)." 3

Page 4: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

Single Root I/O Virtualization in HPC!•  Problem: complex workflows demand increasing

flexibility from HPC platforms"•  Virtualization = flexibility"•  Virtualization = IO performance loss (e.g.,

excessive DMA interrupts)"•  Solution: SR-IOV and Mellanox ConnectX-3

InfiniBand HCAs "•  One physical function (PF) à multiple virtual

functions (VF), each with own DMA streams, memory space, interrupts"

•  Allows DMA to bypass hypervisor to VMs!

Page 5: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

High-Performance Virtualization on Comet !

•  Mellanox FDR InfiniBand HCAs with SR-IOV"•  Rocks to manage high-performance virtual

clusters."•  Flexibility to support complex science

gateways and web-based workflow engines"•  Custom user defined application stack on virtual

clusters."•  Leveraging FutureGrid expertise and experience."•  High bandwidth filesystem(s) access via virtualized

InfiniBand."

Page 6: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

Hardware/Software Configurations of Test Clusters !

Native InfiniBand (SDSC)

SR-IOV InfiniBand

(SDSC)

Native 10GbE (SDSC)

Software-Virtualized 10Gbe

(EC2)

SR-IOV 10GbE (EC2)

Platform

Rocks 6.1 (EL6)

Rocks 6.1 (EL6) kvm Hypervisor

Rocks 6.1 (EL6)

Amazon Linux 2013.09 (EL6) Xen HVM cc2.8xlarge Instance

Amazon Linux 2013.09 (EL6) Xen HVM c3.8xlarge Instance

CPUs Intel(R) Xeon E5-2660 (2.2 GHz) 16 cores/node

Intel(R) Xeon E5-2670 (2.6 GHz) 16 cores/node

Intel(R) Xeon E5-2680v2 (2.8 GHz) 16 cores/node

RAM 64 GB DDR3

60.5 GB DDR3

Inteconnect QDR4X InfiniBand Mellanox ConnectX-3

10GbE

10GbE

(Xen Driver)

10GbE

(Intel VF driver)

Page 7: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

Benchmarks !•  Fundamental performance characteristics of interconnect evaluated using

OSU Micro-Benchmarks – latency, unidirectional and bidirectional bandwidth tests."

•  WRF: widely used weather modeling application that is run in both research and operational forecasting. CONUS-12km benchmark used for performance evaluation."

•  Quantum ESPRESSO: Application that performs density functional theory (DFT) calculations for condensed matter problems. DEISA AUSURF112 benchmark." 7

Page 8: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

Benchmark Build Details!•  OSU Micro-Benchmarks (OMB version 3.9) compiled with OpenMPI

1.5 and GCC 4.4.6.!

•  Both test applications were built with Intel Composer XE 2013 and all of the options necessary to allow the compiler to generate 256-byte AVX vector instructions that were available on all of the testing platforms. !

•  OpenMPI 1.5 for both InfiniBand and 10GbE platform tests.!

•  Intel's Math Kernel Library (MKL) 11.0 to provide all BLAS, LAPACK, ScaLAPACK, and FFTW3 functions where necessary.! 8

Page 9: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

SR-IOV with 10GbE*: Latency Results!

9

MPI point-to-point latency as measured by the osu_latency benchmark.

.

MPI point-to-point latency as measured by the osu_latency benchmark. Error bars are +/- three

standard deviations from the mean.

.

•  12-40% improvement under virtualized environment with SR-IOV .

•  2-2.5X slower than native case, even with SR-IOV.

* SR-IOV provided with Amazon's C3 instances

•  SR-IOV provides 3× to 4× less variation in latency for small message sizes.

Page 10: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

SR-IOV with 10GbE: Bandwidth Results!

10

MPI (a) unidirectional bandwidth and (b) bidirectional bandwidth for 10GbE interconnect as measured by the osu_bw and osu_bibw benchmarks, respectively.

.

•  Unidirectional messaging bandwidth never exceeds 500 MB/s (~40% of line speed).

•  Native performance is 1.5-2X faster. •  Similar results for bidirectional bandwidth.

SR-IOV has very little benefit in both cases. •  SR-IOV helps slightly (13% for random

ring, 17% for natural ring) in collective bandwidth tests.

•  Native total ring bandwidth was more than 2X faster than SR-IOV based virtualized results.

Page 11: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

SR-IOV with InfiniBand: Latency!

11

•  SR-IOV!•  < 30% overhead for M <

128 bytes"•  < 10% overhead for

eager send/recv"•  Overhead à 0% for

bandwidth-limited regime"•  Amazon EC2!

•  > 5000% worse latency"•  Time dependent (noisy)"

OSU Microbenchmarks (3.9, osu_latency)"

Figure 5. MPI point-to-point latency measured by osu_latency for QDR InfiniBand. Included for scale are the analogous 10GbE measurements

from Amazon (AWS) and non-virtualized 10GbE.

.

50x less latency than Amazon EC2!

Page 12: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

SR-IOV with InfiniBand: Bandwidth!

12

•  SR-IOV!•  < 2% bandwidth loss

over entire range"•  > 95% peak bandwidth"

•  Amazon EC2!•  < 35% peak bandwidth"•  900% to 2500% worse

bandwidth than virtualized InfiniBand"

OSU Microbenchmarks (3.9, osu_bw)"

10x more bandwidth than Amazon EC2!

Figure 6. MPI point-to-point bandwidth measured by osu_bw for QDR InfiniBand. Included for scale are the analogous 10GbE measurements

from Amazon (AWS) and non-virtualized 10GbE.

.

Page 13: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

Application Benchmarks: WRF!•  WRF CONUS-12km benchmark. The domain is 12 KM

horizontal resolution on a 425 by 300 grid with 35 vertical levels, with a time step of 72 seconds.!

•  Run using six nodes (96 cores) over QDR4X InfiniBand virtualized with SR-IOV.!

•  SR-IOV test cluster has 2.2 GHz Intel(R) Xeon E5-2660 processors.!

•  Amazon instances were using 2.6 GHz Intel(R) Xeon E5-2670 processors. ! 13

Page 14: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

Weather Modeling – 15% Overhead!•  96-core (6-node)

calculation"•  Nearest-neighbor

communication"•  Scalable algorithms"•  SR-IOV incurs modest

(15%) performance hit"•  ...but still still 20%

faster*** than Amazon"WRF 3.4.1 – 3hr forecast"

*** 20% faster despite SR-IOV cluster having 20% slower CPUs"

Page 15: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

Application Benchmarks: Quantum Espresso!•  DEISA AUSURF 112 benchmark with Quantum ESPRESSO. !

•  Matrix diagonalization is done with the conjugate gradient algorithm. Communication overhead larger than WRF case.!

•  3D Fourier transforms stress the interconnect because they perform multiple matrix transpose operations where every MPI rank needs to send and receive all of its data to every other MPI rank. !

•  These global collectives cause interconnect congestion, and efficient 3D FFTs are limited by the bisection bandwidth of the entire fabric connecting all of the compute nodes.! 15

Page 16: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

Quantum ESPRESSO: 5x Faster than EC2!•  48-core, 3 node calc"•  CG matrix inversion

(irregular comm.)"•  3D FFT matrix

transposes (All-to-all communication)"

•  28% slower w/ SR-IOV"•  SR-IOV still > 500%

faster*** than EC2"Quantum Espresso 5.0.2 – DEISA AUSURF112 benchmark"

*** 20% faster despite SR-IOV cluster having 20% slower CPUs"

Page 17: SR-IOV: Performance Benefits for Virtualized Interconnects

SAN DIEGO SUPERCOMPUTER CENTER

Conclusions!•  SR-IOV: huge step forward in high-performance virtualization"

•  Shows substantial improvement in latency over Amazon EC2, and it has negligible bandwidth overhead!

•  Benchmark application performance confirms this: significant improvement over EC2"

•  SR-IOV: lowers performance barrier to virtualizing the interconnect and makes fully virtualized HPC clusters viable!

•  Comet will deliver virtualized HPC to new/non-traditional communities that need flexibility without major loss of performance!