Scaling Results From Isambard: the First...
Transcript of Scaling Results From Isambard: the First...
Scaling Results From Isambard: the First Generation of Arm-based Supercomputers
Prof Simon McIntosh-SmithIsambard PIUniversity of Bristol /GW4 Alliance
Isambard system specification• 10,752 Armv8 cores (168n x 2s x 32c)
• Cavium ThunderX2 32core 2.1à2.5GHz• Cray XC50 ‘Scout’ form factor• High-speed Aries interconnect• Cray HPC optimised software stack
• CCE, Cray MPI, math libraries, CrayPAT, …• Phase 2 (the Arm part):
• Delivered Oct 22nd, handed over Oct 29th
• Accepted Nov 9th
• Upgrade to final B2 TX2 silicon, firmware, CPE completed March 15th 2019
http://gw4.ac.uk/isambard/
Cavium ThunderX2, a seriously beefy CPU• 32 cores at up to 2.5GHz• Each core is 4-way superscalar, Out-of-Order• 32KB L1, 256KB L2 per core• Shared 32MB L3• Dual 128-bit wide NEON vectors
• Compared to Skylake’s 512-bit vectors, and Broadwell’s 256-bit vectors• 8 channels of 2666MHz DDR4
• Compared to 6 channels on Skylake, 4 channels on Broadwell• AMD’s EPYC also has 8 channels
http://gw4.ac.uk/isambard/
Recap of Single Node results from CUG 2018
http://gw4.ac.uk/isambard/
SKL 20c Intel Skylake Gold 6148, $3,078 eachTX2 32c Cavium ThunderX2, $1,795 each (near top-bin)
6 S. McIntosh-Smith et al
Processor Cores Clock TDP FP64 Bandwidthspeed Watts TFLOP/s GB/sGHz
Broadwell 2⇥ 22 2.2 145 1.55 154Skylake Gold 2⇥ 20 2.4 150 3.07 256Skylake Platinum 2⇥ 28 2.1 165 3.76 256ThunderX2 2⇥ 32 2.2 175 1.13 320
TABLE 1 Hardware information (peak �gures)
Cores
TFLOPS/s
L1bandwidth
(agg.TB/s)
L2bandwidth
(agg.TB/s)
L3bandwidth
(agg.GB/s)
Memory b
and-
width (GB/s)
0
0.5
1
1.5
2
2.5
44 1.55 6.31 2.23 726 131.2
56
3.76
11.18
3.57
767.2
214.9
64
1.13
3.46
2.14
537.6
253.4
Relativ
e�g
ures
ofmerit
(normalize
dto
Broa
dwell)
Broadwell 22c Skylake 28c ThunderX2 32c
FIGURE 2 Comparison of properties of Broadwell 22c, Skylake 28c and ThunderX2 32c. Results are normalized to Broadwell.
that achieved the highest performance in each case was used in the results graphs displayed below. Likewise for the Intel processors, we usedGCC 7, Intel 2018, and Cray CCE 8.5–8.7. Table 2 lists the compiler that achieved the highest performance for each benchmark in this study.
4.2 Mini-apps
Figure 3 compares the performance of our target platforms over a range of representative mini-applications.STREAM: The STREAM benchmark measures the sustained memory bandwidth from the main memory. For the processors tested, the available
memory bandwidth is essentially determined by the number of memory controllers. Intel Xeon Broadwell and Skylake processors have four and sixmemory controllers per socket, respectively. The Cavium ThunderX2 processor has eight memory controllers per socket. The results in Figure 3show a clear trend that Skylake achieves a 1.64⇥ improvement over Broadwell, which is to be expected, given Skylake’s faster memory speed
Benchmarking platforms
http://gw4.ac.uk/isambard/
Previous single node performance results
CP2K
GROMACS
NAMD
NEMO
OpenFOAM
OpenSBLI
Unified
Model
VASP
GeometricMean
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 1 1 1 1 1 1 1 1
1.29
1.29
0.98
1.44 1.57
1.39
1.06
1.32
1.281.37 1.45
1.21
1.65
1.66
1.72
1.19
1.42
1.45
1.15
0.68
1.16
1.49
1.87
1.69
0.92
0.76
1.15
Perform
ance
(normalized
toBroad
well)
Broadwell 22c Skylake 20c Skylake 28c ThunderX2 32c
https://github.com/UoB-HPC/benchmarks
Scalability comparisons• We’ve plotted results using ‘Scaling (parallel) efficiency’• We’ve compared against two x86-based XC50 systems:
• Horizon using Intel Skylake Gold 6148 20-core CPUs at 2.4GHz• Swan using Intel Skylake Platinum 8176 28-core CPUs at 2.1GHz• Could only go up to 64 nodes on these systems, though we could have
gone up to 164 on Isambard• All the results are for strong scaling, except SNAP• All of these systems use the same interconnect (Aries) and the
same O/S and MPI library, so this is a good test of whether Arm-based ThunderX2 scales as well as x86
http://gw4.ac.uk/isambard/
CloverLeaf scaling – relative performance
http://gw4.ac.uk/isambard/
CloverLeaf scaling – parallel efficiency
http://gw4.ac.uk/isambard/
TeaLeaf scaling – relative performance
http://gw4.ac.uk/isambard/
TeaLeaf scaling – parallel efficiency
http://gw4.ac.uk/isambard/
SNAP scaling – relative performance
http://gw4.ac.uk/isambard/
SNAP scaling – parallel efficiency
http://gw4.ac.uk/isambard/
GROMACS scaling – relative performance
http://gw4.ac.uk/isambard/
GROMACS scaling – parallel efficiency
http://gw4.ac.uk/isambard/
NEMO scaling
http://gw4.ac.uk/isambard/
Parallel efficiency Relative performance
OpenFOAM scaling
http://gw4.ac.uk/isambard/
Parallel efficiency Relative performance
OpenSBLI scaling
http://gw4.ac.uk/isambard/
Parallel efficiency Relative performance
VASP scaling
http://gw4.ac.uk/isambard/
Parallel efficiency Relative performance
Which compilers were best in each case?
http://gw4.ac.uk/isambard/
Isambard scaling summary• Arm-based systems appear to scale just as well as x86 ones• For certain codes that were compute-bound at low scale, these
became network bound at ‘real’ scale, levelling the playing field• We’re seeing a minor issue with scaling in two cases, appears to
be related to MPI collectives – investigations are underway• The software stack has been robust, reliable and high-quality
(both the commercial and open source parts)• Now have evidence that Arm-based systems are real alternatives
for HPC, reintroducing much needed competition to the market
http://gw4.ac.uk/isambard/
Tom Deakin
The Bristol HPC team doing this work
Andrei PoenaruJames Price
Also thanks go to:• The Isambard project members: the GW4 Alliance, the Met Office, Arm, Marvell and Cray• Cray for access to the Swan and Horizon x86 systems• EPSRC for funding the project
For more information
Comparative Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard S. McIntosh-Smith, J. Price, T. Deakin and A. Poenaru, CUG 2018, Stockholm
http://uob-hpc.github.io/2018/05/23/CUG18.html
Bristol HPC group: https://uob-hpc.github.io/
Isambard: http://gw4.ac.uk/isambard/
Build and run scripts: https://github.com/UoB-HPC/benchmarks
http://gw4.ac.uk/isambard/
Backup
http://gw4.ac.uk/isambard/
Comparison of compilers
on Arm
http://gw4.ac.uk/isambard/