Intel True Scale Fabric Architecture: Three Labs, One Conclusion · Red Sky Chama 1e+06 1e+07 1e+08...
Transcript of Intel True Scale Fabric Architecture: Three Labs, One Conclusion · Red Sky Chama 1e+06 1e+07 1e+08...
WHITE PAPERIntel® True Scale Fabric ArchitectureSupercomputing
Intel® True Scale Fabric Architecture: Three Labs, One ConclusionIntel® True Scale Fabric in National Laboratory Systems Changes the View of Interconnects and the Work in Supercomputing
EXECUTIVE SUMMARY
Three United States national laboratories, known for their work in supercomputing,
recently benchmarked new systems delivered to each of them. These systems are
built on the Intel® Xeon® processor E5-2600 family and Intel® True Scale Fabric, based
on InfiniBand* and the open source Performance Scale Messaging (PSM) interface.
The scientists performing the benchmarks concluded in individual reports that
Intel True Scale Fabric:
· Contributed to “unprecedented scalability and performance” in their systems, and
it is allowing them to change how they work.
· Outperforms on some tests one of the most powerful, customized supercomputers
in the world, rated 18 in the November, 2012 Top500 list.1
· Delivers a level of performance they had not seen from a commodity
interconnect before.
The new systems, named Chama (at Sandia National Laboratories), Luna (at Los Alamos
National Laboratory), and Zin (at Lawrence Livermore National Laboratory), are part
of the Tri-Labs Linux* Capacity Clusters 2 (TLCC2) in the Advanced Simulation and
Computing (ASC) program under the National Nuclear Security Administration (NNSA).
This paper summarizes the findings of the reports from these three laboratories.
TAblE oF ConTEnTSExecutive Summary . . . . . . . . . . . . . . 1
When Fast is not Fast Enough . . . . . 2
Intel a Force in HPC . . . . . . . . . . . . . . 2
Key Components to a Fast System . . . . . . . . . . . . . . . . . . .2
Intel, TlCC, and TlCC2 . . . . . . . . . . . . 2
“Unprecedented Scalability and Performance” – Chama at Sandia national laboratories . . . . . . 3
Key Findings . . . . . . . . . . . . . . . . . . . . . . .4
Messaging Micro-benchmarks . . . . . .4
- Bandwidth and Latency . . . . . . . .4
- MPI Message Rate . . . . . . . . . . . . . .5
- Random Message Bandwidth . . .5
- Global Communications – - MPI Allreduce . . . . . . . . . . . . . . . . . .5
Application Testing. . . . . . . . . . . . . . . . .5
- Cielo Acceptance Benchmarks . .7
Sandia’s Conclusions . . . . . . . . . . . . . . .7
“Changing the Way We Work” – luna at los Alamos national laboratory . . . . . . . . . . . . . . 8
Key Findings . . . . . . . . . . . . . . . . . . . . . . .8
Application Testing. . . . . . . . . . . . . . . . .9
Communications Micro-benchmarks . . . . . . . . . . . . . . . . .9
- Node-to-Node Bandwidth - and Adapter Contention . . . . . . 10
- Global Communications – - MPI_Allreduce Results . . . . . . . . 10
Los Alamos’ Conclusions . . . . . . . . . . 10
Supreme Scalability – Zin at lawrence livermore national laboratory . . . . . . . . . . . . . 11
Summary and Conclusions . . . . . . . 11
2
Intel® True Scale Fabric Architecture: Three Labs, One Conclusion
WHEn FAST IS noT FAST EnoUgHInfiniBand Architecture has proven itself over the years as the interconnect technology of choice for high-performance computing (HPC). For a commodity inter-connect, it continues to achieve perfor-mance advances above other industry standard networks, and it outperforms them by a significant factor. But, when it comes to the demands of HPC and MPI message passing, fast is never fast enough. While MPI, using InfiniBand Verbs, delivers fast communications, there is a costly overhead with Verbs and traditional offload processing on InfiniBand Host Channel Adapters (HCAs) that hinders scalability with larger core counts.
Intel True Scale Fabric, with its open source Performance Scale Messaging (PSM) interface and onload traffic processing, was designed from the ground up to accel-erate MPI messaging specifically for HPC. Intel True Scale Fabric delivers very high message rates, low MPI latency and high effective application bandwidth, enabling MPI applications to scale to thousands of nodes. This performance drove the choice of interconnect for the most recent ac-quisitions in the Advanced Simulation and Computing (ASC) Program’s Tri-Labs Linux Capacity Clusters 2 (TLCC2): Chama (at Sandia National Laboratories), Luna (at Los Alamos National Laboratory), and Zin (at Lawrence Livermore National Laboratory).
InTEl, TlCC, And TlCC2The ASC Program under the National Nuclear Security Administration (NNSA) provides leading-edge, high-end simulation capabilities to support the Administration’s mission. Some of the fastest supercomput-ers in the world are managed under the ASC at the three NNSA laboratories: Los Alamos National Laboratory, Sandia National Laboratories, and Lawrence Livermore National Laboratory. These machines include “capacity” and “capability” HPC systems designed for a range of computing jobs and users.
Capacity and capability machines are generally distinguished by their differences in size and users. While both categories have grown in computing abilities over the years, capability systems are typically dedicated to a smaller group of users and are much larger, comprising a number of cores as much as a magnitude higher than capacity machines (hundreds of thousands compared to tens of thousands of cores).
The Tri-Lab Linux Capacity Clusters (TLCC) contribute to capacity computing at the three NNSA laboratories. TLCC is designed for scalability to adapt the resources to each job’s computing requirements, while running multiple jobs simultaneously. Thus, the systems consist of a number of Scal-able Units (SU), each SU comprising 162 compute, user and management nodes, 2,592 cores and delivering about 50 tera-FLOPS/SU. One TLCC procurement included the supercomputer Sierra, built with Intel True Scale Fabric components, housed at Lawrence Livermore National Laboratory.
The more recent procurements for the sec-ond procurement of scalable Linux clusters, TLCC2, consist of three large Linux clusters, one each housed at an NNSA laboratory:
• Chama – 8 SUs, with 1,296 nodes, located at Sandia National Laboratories in Albuquerque, New Mexico
InTEl, A FoRCE In HPCIntel has a long history in high-performance computing (HPC) systems and the national laboratories that use them. Intel built the first massively parallel processing (MPP) machine to reach one teraFLOP, and delivered it in 1996 to the Advanced Computing and Simulation (ASC) Program (formerly ASCI) as Option Red. Intel continues to be a driving force in supercomputing with Intel® proces-sors in more systems on the Top500 list1 of the world’s fastest supercomputers than any other manufacturer. But it takes more than just a fast processor to live among the fastest 500 systems.
Key Components to a Fast System
The fastest systems use more than just Intel processors. Intel provides the components and software tools to help achieve the highest performing codes on some of the nation’s most critical computing jobs.
• Intel® Xeon® processors – 377 (75 percent) of the top Top500 supercomputers use Intel® Architecture processors.
• Intel® True Scale Fabric – designed specifically for HPC in order to minimize communications and enable efficient systems, Intel True Scale Fabric enables the fastest clusters based on InfiniBand* Architecture.
• Intel® Xeon Phi™ coprocessors – built on many-core architecture, Intel Xeon Phi coprocessors offer unparalleled acceleration for certain codes.
• Intel® Software Tools – a host of tools support cluster builders and application programmers to make their codes fast and efficient.
• Intel® Storage Systems – HPC demands the fastest components, and Intel storage components deliver both speed and reliability.
3
Intel® True Scale Fabric Architecture: Three Labs, One Conclusion
• Luna – 10 SUs, with 1,620 nodes, located at Los Alamos National Laboratory in Los Alamos, New Mexico
• Zin – 18 SUs with 2,916 nodes, located at Lawrence Livermore National Laboratory in Livermore, California
All three machines are built around Intel® technologies, including Intel® Xeon® processors and Intel True Scale Fabric HCAs and switches. At all three laborato-ries, users and laboratory scientists have reported significant performance and scalability improvements over other machines, triggering scientists to take a new look at how their work gets done.
“UnPRECEdEnTEd SCAlAbIlITY And PERFoRMAnCE” – CHAMA AT SAndIA nATIonAl lAboRAToRIESSandia National Laboratories, headquar-tered in Albuquerque, New Mexico, has, over the last six decades, “delivered es-sential science and technology to resolve the nation’s most challenging security issues.”2 Sandia has a long history of high-performance computing. It is the home of the nation’s first teraFLOP supercomputer, ASCI Option Red, built by Intel in 1996. As one of the laboratories providing capacity computing to the NNSA ASC program, it received its latest TLCC2 capacity machine, Chama, in 2012.
With the acquisition of Chama, users began reporting 2x to 5x performance improve-ment on their jobs. Sandia scientists wanted to “understand the characteristics of this new resource.” So, they performed micro-benchmarks and application program testing on Chama and two other systems at Sandia: Red Sky, another capacity computing machine and predecessor to Chama in the TLCC, and Cielo, a capabil-ity supercomputer. Their findings are captured in their report.3
Table 1 lists the system configurations for Chama, Red Sky, and Cielo.
ConFIgURATIon CHAMA REd SKY CIElo
COmpuTe COmpLex
Total Computing Nodes
1,232 2,816 8,894
processor Architecture
Intel® Architecture formerly codenamed Sandy Bridge
Intel® Architecture formerly codenamed Nehalem
AmD magnyCours*
Cache L1 (KB) L2 (KB) L3 (mB)
8 x 32 8 x 256 20
4 x 32 4 x 256 8
8 x 64 8 x 512 10
Cores/Node 16 8 16
Total Cores 19,712 22,528 142,304
Clock Speed (GHz) 2.60 2.93 2.40
Instruction Set Architecture (ISA)
Intel® AVx SSe4.2 SSe4a
memory DDR3 1600 mHz DDR3 1333 mHz DDR3 1333 mHz
memory/Core (GB) 2 1.5 2
Channels/Socket 4 3 4
peak Node GLFOpS 332.8 94.76 153.6
INTeRCONNeCT
manufacturer Technology/Rate IB HW Interface Topology
Intel (Qlogic) InfiniBand* QDR pSm Fat Tree
mellanox* InfiniBand QDR Verbs 3D Torus: 6 x 6 x 8
Gemini* Custom Custom 3D Torus: 18 x 12 x 24
Table 1 . Sandia National Laboratories Test Systems.
4
Intel® True Scale Fabric Architecture: Three Labs, One Conclusion
CHA
MA
/CIE
LO C
OM
PARI
SON
15
1
0.5
2
2.5
3
3.5
4
4.5
5
0101 100 1000 10000 100000 1e+06 1e+07
MESSAGE SIZE (BYTES)
1 Task/node2 Tasks/node
4 Tasks/node 16 Tasks/node
8 Tasks/node
COMPARISON
LOWER IS BETTER
15
2
2.5
3
3.5
4
4.5
0168 32 64 128 256 512 1024
MIC
ROSE
CON
DS
MESSAGE SIZE (BYTES)
Cielo (X&Z)Cielo (Y)
Red Sky
Chama
1e+06
1e+07
1e+08
1e+09
10000010 100 1000 10000
BYTE
S/SE
CON
D/M
PI T
ASK
MPI RANKS
Cielo
Red Sky
Chama
HIGHER IS BETTER
Key Findings
Sandia scientists tested the systems across a range of characteristics beyond those impacted by interconnect, including memory performance and contention, processor performance, and more. Chama proved to be a well-balanced system with impressive performance results that outperformed Red Sky and compared well against Cielo. However, this paper focuses on the results of interconnect benchmarks and application testing to understand how interconnect contributes to the overall HPC performance. Thus, the tests revealed the following about the Intel True Scale Fabric interconnect:
• Chama returned unprecedented results in MPI messaging rate at message sizes up to 1 KB, outperforming even Cielo’s custom interconnect.
• Chama delivered random messaging bandwidth the scientists had not yet seen from a commodity interconnect, exceeding Cielo by as much as 30 percent.
• Collectives performance scaling for Chama compares well against the custom interconnect of Cielo, both outperform-ing Red Sky by an order of magnitude.
• Chama scaled well against Cielo on three Sandia Finite Element production appli-cations, which revealed severe scaling limitations on Red Sky.
The key findings from these micro- benchmarks and application tests indicate that Chama, with its Intel True Scale Fabric, “has a strong impact on applications” as attested by Chama users.
Messaging Micro-benchmarks
While standard traditional metrics include inter-node latency and bandwidth, Sandia scientists were keenly interested in Chama’s MPI messaging rate and scalable random message bandwidth performance. Figure 1 shows the benchmark results for these tests.
bandwidth and latency
Sandia codes are more sensitive to band-width than latency; this effect drove the choice for Chama’s Intel True Scale Fabric interconnect. As shown in Figures 1a and 1b, Chama performed well compared to Cie-lo’s custom Gemini* interconnect, according to Sandia scientists. We note that with sizes well within the typical HPC message size space, Red Sky’s bandwidth climbed much more slowly, remaining about half of Chama’s, and latency began to dramatically increase at just 64-byte messages.
Figure 1 . Sandia Inter-node MPI Performance.
C . Message Rate
A . Bandwidth
d . Random Messaging Bandwidth
b . Latency
500
1000
1500
2000
2500
3000
3500
4000
0101 100 1000 10000 1000000 1e+06 1e+07
MBY
TES/
SECO
ND
MESSAGE SIZE (BYTES)
Cielo (X&Z)Cielo (Y)
Red Sky
Chama
HIGHER IS BETTER
5
Intel® True Scale Fabric Architecture: Three Labs, One Conclusion
1
10
100
1000
10000
0.141 16 64 163844096256 1024
AV
ERA
GE T
IME,
MIC
ROSE
CS
MPI RANKS
Cielo
Red Sky
Chama
LOWER IS BETTER
1
10
100
1000
0.141 16 64 163844096256 1024
AV
ERA
GE T
IME,
MIC
ROSE
CS
MPI RANKS
Cielo
Red Sky
Chama
LOWER IS BETTER
MPI Message Rate
Of particular interest to the testers at Sandia, was the ability of the intercon-nect to process messages as core counts increased. HCA congestion on multi-core nodes is “becoming a significant constraint” in HPC with commodity interconnects, even those based on InfiniBand Architecture. “Therefore, the most important internode behavior for Chama is the significant gain in MPI message rate in comparison to Cielo.”
For message sizes up to 1 KB, the Intel True Scale Fabric outperformed the cus-tom interconnect of Cielo by 2x to 4x. For Sandia, this was an unprecedented event, which “…can have a significant positive impact on many applications, such as those that employ a sparse solver kernel.”
Random Message bandwidth
Not all inter-node communications are structured. Indeed, many applications, such as Charon, induce unstructured communications across the fabric. Understanding node behavior using a measure of random message traffic can more readily predict system performance with such codes. Sandia uses a random messaging benchmark for understanding
scalability in commodity clusters. The test “sends thousands of small messages from all MPI tasks with varying message sizes (100 bytes to 1 KB) to random MPI rank destinations.” An aggregate average ran-dom messaging bandwidth (Figure 1d) was derived from per process measurements.
The measurements showed the following results, which the scientists had never seen with commodity interconnects bench-marked against a custom architecture:
• Red Sky, compared to Chama, performed from 10x slower (32 cores) to 220x slower (8,192 cores)
• Chama was 20 to 30 percent faster than Cielo, the capability supercomputer
Chama’s Intel True Scale Fabric scales ex-tremely well with applications that create random traffic on large systems.
global Communications – MPI Allreduce
For understanding behavior of Chama with applications that are sensitive to collective operations, Sandia averaged scalability per-formance data from a thousand trials using 8, 64, and 1024 byte transfers. As shown in Figure 2, Chama performs competitively to Cielo across all ranks. Both perform an
order of magnitude better than Red Sky in some cases, with Red Sky’s performance falling off above 1 KB messages.
Application Testing
With respect to Red Sky, the above benchmarks highlight the discoveries of previous studies Sandia performed on commodity clusters like Red Sky and Chama, namely the poor scalability with applications that use implicit solvers and the poor parallel efficiency with higher amounts of unstructured message traffic. (Characteristics not exhibited by Chama in the micro-benchmarks.)
These results and other discoveries in previous commodity clusters provided a “strong case” for Sandia to invest in more custom MPP machines. However, users of Chama have reported performance improvements with their application codes of 2x to 5x. To further understand these experiences, scientists proceeded with application testing.
Figure 2 . IMB MPI_Allreduce Performance.
10
100
1000
10000
141 16 64 163844096256 1024
AV
ERA
GE T
IME,
MIC
ROSE
CS
MPI RANKS
Cielo
Red Sky
Chama
LOWER IS BETTER
A . 8 bytes b . 64 bytes C . 1024 bytes
6
Intel® True Scale Fabric Architecture: Three Labs, One Conclusion
Table 2 lists four of the five applications used, along with their results, to help re-veal how Chama compared to Red Sky and Cielo at scale. These results are consistent with users’ experiences.
Figure 3 graphs the results for the Finite Element Method tests Aleph, Aria, and Charon; Figure 4 shows the performance for AMG2006. Again, Red Sky exhibits severe scaling limitations, while Chama outperforms Cielo on all tests.
APPlICATIonSCIEnCE doMAIn
KEY AlgoRITHM
TIMIng METRIC
MAXIMUM PERFoRMAnCE IMPRoVEMEnT AT SCAlE
CHAMA: REd SKY
CHAMA: CIElo
Aleph plasma simulation Finite element method (Fem) particle move + field solves
Weak scaling, fixed number of steps
4.2x 1.3x
AmG2006 Algebraic multigrid Laplace solver, preconditioned Conjugate Gradient
Weak scaling, 100 iterations
1.5x 1.75x
Aria CFD, Thermodynamics
Implicit Fem Strong scaling, 25 time steps
3.4x 2.6x
Charon Semiconductor device simulation
Implicit Fem Weak scaling, fixed number of iterations
2.5x 1.6x
Table 2 . Sandia Application Scaling Tests.
Figure 4 . AMG2006 Scaling Comparisons.
100
50
150
250
200
0641641 256 1638440961024
PCG
SOLV
E TI
ME
(SEC
S)
MPI RANKS
Cielo Red SkyChama
LOWER IS BETTER
Figure 3 . Charon, Aleph, and Aria Application Scaling.
0.3
0.25
0.15
0.2
0.35
0.4
0.45
0.5
0.16432 128 256 1638440962048 8192512 1024
ML/
AZT
EC T
IME
PER
BICG
STA
B LT
R (S
ECS)
MPI RANKS
Cielo
Red Sky
Chama
LOWER IS BETTER
2500
2000
1000
1500
3000
3500
4000
4500
500100 100001000
TIM
E (S
ECS)
MPI RANKS
Cielo
Red Sky
Chama
LOWER IS BETTER
80
60
20
40
100
120
01 1000100
TIM
E (S
ECS)
MPI RANKS
10
Cielo
Red Sky
Chama
LOWER IS BETTER
A . Charon
b . Aleph
C . Sierra/Aria
7
Intel® True Scale Fabric Architecture: Three Labs, One Conclusion
Cielo Acceptance benchmarks
A number of other applications were benchmarked on Chama, not covered by the current Sandia report. However, results of four of the six Tri-Lab Cielo acceptance benchmarks were included. They are shown in Figure 5. While “not as spectacular” as the earlier tests, Sandia scientists considered these results good.
Sandia’s Conclusions
Sandia scientists stated the results for Chama’s Intel True Scale Fabric perfor-mance to be “unprecedented” and “never before seen” for a commodity interconnect. With its onload processing and its PSM interface, Chama’s Intel True Scale Fabric outperformed Red Sky’s verbs-based Infini-Band communications and was competitive with the capability supercomputer Cielo. MPI profiles revealed that Chama’s faster MPI processing of the Intel True Scale Fabric contributed to its scalability and to the 2x to 5x performance improvement experienced by Chama’s users.
Figure 5 . Cielo Acceptance Test Performance Summary.
SAGE
AMG2006
UMT
HPCCG
CTH
0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.60.0
PERFORMANCE GAIN FACTOR: CHAMA OVER RED SKY
1024 MPI Tasks 128 MPI Tasks 16 MPI Tasks
SAGE
AMG2006
UMT
HPCCG
CTH
0.5 1.0 1.5 2.0 2.50.0
PERFORMANCE GAIN FACTOR: CHAMA OVER CIELO
1024 MPI Tasks 128 MPI Tasks 16 MPI Tasks
“The performance gains that Sandia users are experiencing
with their applications on…Chama, has resulted in many positive
feedbacks from happy users. …we are seeing unprecedented
performance and scalability of many key Sandia applications.”
8
Intel® True Scale Fabric Architecture: Three Labs, One Conclusion
CATEgoRY PARAMETER TYPHoon lUnA
Cpu core make model Clock speed L1 data cache size L2 cache size
AmD magny-Cours* Opteron* 6128 2.0 GHz 64 KB 0.5 mB
Intel® Sandy Bridge Intel® xeon e5-2670 2.6 GHz 32KB 0.25mB
Cpu socket Cores Shared L3 cache size memory controllers
8 12 mB 4 x DDR3-1333
8 20 mB 4 x DDR3-1600
Node Sockets memory capacity
4 64 GB
2 32 GB
Network make Type
mellanox* QDR InfiniBand (Verbs)
Intel® True Scale Fabric QDR InfiniBand (pSm)
System Integrator Compute nodes I/O nodes Installation date
Appro 416 12 march 2011
Appro 1540 60 April 2012
“CHAngIng THE WAY WE WoRK” – lUnA AT loS AlAMoS nATIonAl lAboRAToRYLos Alamos National Laboratory has a nearly 70 year history of discovery and innovation in science and technology. Its mission is to “develop and apply science and technology to ensure the safety, security, and reliability of the U.S. nuclear deterrent; to reduce global threats; and solve other emerging national security and energy challenges.”4
In 2012, it acquired Luna as part of TLCC2, and “reports from users have been extremely positive.” In particular, two directed stockpile work (DSW) prob-lems completed by users Mercer-Smith and Scott ran 3.9x and 4.7x faster on Luna than other systems. Scientists at Los Ala-mos were asked to understand why Luna performed so much better. Their research is captured in benchmarks and application testing between Luna and Typhoon.5
Table 3 lists the configurations of the two systems used in the evaluation.
Key Findings
Los Alamos scientists performed applica-tion tests to compare performance and scalability plus micro-benchmarks to help understand what makes the systems per-form differently. As at Sandia, the tests were comprehensive across a variety of characteristics; however, this paper focuses on the results of interconnect micro-benchmarks and application testing.
We note that the authors discovered Typhoon exhibited atypical InfiniBand bandwidth performance during the single-node communication micro-benchmark. This led to a later evaluation of Typhoon’s InfiniBand performance and an ensu-ing report.6 The findings revealed that a configuration problem caused a lower than expected InfiniBand performance on Typhoon. When corrected and the ap-plication xRAGE used in the current tests was rerun, Typhoon improved by about 21 percent on xRAGE. Whether or not this handicap carried across to all Typhoon tests is unclear. Thus, in this paper, where
appropriate, we awarded Typhoon a 21 percent benefit and present the resultant values in parentheses next to the original report’s results. Nonetheless, Luna gener-ally outperformed Typhoon on every test and micro-benchmark Los Alamos performed, with some variability.
The Los Alamos tests revealed the following:
• Across several comparisons, Luna rates from 1.2x to 4.7x faster than Typhoon.
• Luna’s interconnect supports nearly full InfiniBand QDR bandwidth with little to no contention scaling to 16 cores, while Typhoon starts out fast and degrades steadily to 32 cores without achieving nearly full InfiniBand speeds.
• At 16 cores, Luna’s Intel True Scale Fabric is 2.10x (1.74x) faster than Typhoon; at 32 cores, the difference rises to 2.19x (1.81x) faster.
• Collectives performance showed Luna with an average of 1.95x (1.61x) improve-ment over Typhoon, but with variability.
The key findings from these micro- benchmarks and application tests indicate that Luna, with its Intel True Scale Fabric, delivers a wide range of performance improvements over Typhoon.
Table 3 . Los Alamos National Laboratory Test Systems.
“Luna is the best machine
that the laboratory
has ever had.”
9
Intel® True Scale Fabric Architecture: Three Labs, One Conclusion
APPlICATIon/SoURCE lUnA:TYPHoon
dESCRIPTIon
Theoretical peak memory bw
1.2x This is the simple ratio of Luna’s memory to Typhoon’s memory bandwidth
xRAGe 1.56x (1.29x) A collectives-heavy code
eAp test suite 1.69x (1.40x) A collection of 332 regression tests from the eulerian Applications project (eAp) run nightly on Luna and Typhoon
mizzen problem 2.07x (1.71x) An integrated code representative of the types of codes normally run on Luna and Typhoon
Theoretical compute rate
2.6x Calculated maximum theoretical FLOps
High-performance Linpack* benchmark
2.72x According to the June 2012 Top500 list
partisn, sn timing 2.75x (2.28x) A more communications-active code compared to xRAGe, with many small message exchanges
ASC1 code (mercer-Smith & Scott)
3.9x DSW problem; not part of the current testing
ASC2 code (mercer-Smith & Scott)
4.7x DSW problem; not part of the current testing
Application Testing
Los Alamos scientists performed four application tests with variations on the number of cores and nodes for different tests. They tried to thoroughly under-stand what drives Luna’s significant improvements and attempted to repeat the improvements Mercer-Smith and Scott experienced. The tests and source of other metrics are briefly described in Table 4 along with the results.
The extent of their comprehensive test-ing is beyond the capacity of this paper; therefore only the results are summarized below, shown in Figure 6.
Using theoretical calculations, actual mea-surements and the experiences reported by users, Luna averages about 2.5x faster than Typhoon.
Communications Micro-benchmarks
As with Sandia, Los Alamos scientists ran several micro-benchmarks to isolate some of the causes of Luna’s performance edge over Typhoon. Los Alamos tests also iso-lated several improvements at the node and processor architectural levels. But, again, this paper focuses on the results that the interconnect contributed to the overall performance.
Table 4 . Application Test Descriptions.
Figure 6 . Luna: Typhoon Applications Performance Summary.
INTEGRATED CODE, MIZZEN PROBLEM (BROWN)
THEORETICAL PEAK COMPUTE RATE
HIGH-PERFORMANCE LINPACK BENCHMARK
ASC1 CODE/PROBLEM (MERCER-SMITH & SCOTT)
ASC2 CODE/PROBLEM (MERCER-SMITH & SCOTT)
PARTISN, sn TIMING (PAKIN & LANG)
EAP TEST SUITE, GEOMETRIC MEAN (EAP TEAM)
THEORETICAL PEAK MEMORY BANDWIDTH
xRAGE, ASTEROID PROBLEM (PAKIN & LANG)
0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
LUNA: TYPHOON PERFORMANCE RATIO, 128 MPI RANKS
1.2
1.56
1.69
2.07
2.6
2.72
2.76
3.9
4.7
10
Intel® True Scale Fabric Architecture: Three Labs, One Conclusion
los Alamos’ Conclusions
Luna outperforms Typhoon from 1.2X to 4.7x, as indicated by both theoretical and actual results. The authors conclude that “…almost all key components of Luna—CPUs, memory and network—are faster than their Typhoon counterparts, but by widely varying amounts (and in nonlinear patterns) based on how these hardware resources are utilized.”
Indeed, Luna is considered the best machine the Laboratory owns by one set of users. Other user experiences are quite positive, to the point that it is having an impact on some work going forward.
global Communications – MPI Allreduce Results
For collectives performance, the Los Alamos authors created a micro-benchmark that reports the average time per MPI_Allreduce operation for various message sizes across 128 MPI ranks. Figure 8 graphs the results of Luna’s to Typhoon’s performance. The authors note “…the geometric mean of the measurements indicate that Typhoon takes an average (horizontal line) of 1.95x (1.61x) as long as Luna to perform an MPI_Allreduce…” However, they also are drawn to the vari-ability of the results. They consider it, like other results in their study, that “there is a large set of corner cases where Luna can be many times faster than Typhoon—and some applications may in fact hit these cases—but more modest speedups are the more common case.”
node-to-node bandwidth and Adapter Contention
This micro-benchmark exchanges a large volume of data between two nodes, starting with a single core on each node and scaling to all 16 cores on a node (for Luna) or 32 cores (for Typhoon). The test records the bandwidth consumed for each exchange. Figure 7 charts the results.
For Luna, the first exchanges do not satu-rate the network, but within four cores, full speed is achieved at 3,151 B/μs and held across all 16 cores, with little measur-able degradation from contention. This was also seen at Sandia where the messaging rate scaled well across many MPI ranks. Typhoon’s network, however, while start-ing out faster than Luna at 1,879 B/μs, degraded steadily to 1,433 B/μs as core count increased, indicative of contention as the adapter tries to pass traffic from more cores.
The scientists determined, “while Luna’s per-core (aggregate divided by number of cores) communication bandwidth is 2.10x [(1.74x)]7 that of Typhoon’s at 16 cores/node, this ratio increases to 2.19x [(1.81x)] when comparing a full Luna node to a full Typhoon node.”
Figure 7 . Network Bandwidth as a Function of Contention for the NIC. Figure 8 . Ratio of Luna’s MPI_Allreduce latency to Typhoon’s for 128 MPI Ranks.
0
1,000
2,000
3,000
4,000
4 8 12 16 20 24 28 32
AGG
REGA
TE C
OM
MU
NIC
ATI
ON
BA
NDW
IDTH
(B/µ
s)
NUMBER OF COMMUNICATING PAIRS OF PROCESSES
Typhoon Luna Theoretical Peak
7
6
5
4
3
2
1
022 24 26 28 210 212 214 216 218 220
TYPH
OO
N: L
UN
A M
PI_A
LLRE
DUCE
LA
TEN
CY
MESSAGE SIZE (W)
“Luna tends to be about twice as fast as
Typhoon across the various micro-benchmarks,
but there are many outliers.”
11
Intel® True Scale Fabric Architecture: Three Labs, One Conclusion
SUPREME SCAlAbIlITY – ZIn AT lAWREnCE lIVERMoRE nATIonAl lAboRAToRYBeginning operations in 1952, Lawrence Livermore National Laboratory has grown into a diverse complex of science, research, and technology, part of which supports the ASC Program and missions of the NNSA. The Terascale Simulation Facility (TSF) at Lawrence Livermore National Laboratory houses TLCC2 clusters and includes the world’s second fastest supercomputer, Sequoia, according to the Top500 list.1
In 2011, the Laboratory acquired Zin, the latest addition to its TLCC2. The Zin cluster comprises 2,916 nodes, 46,208 cores, and Intel True Scale Fabric network. Soon after it was delivered in 2011, it was awarded number 15 on the Top500 list of the fast-est supercomputers in the world. A year later, it is still in the top 30 fastest systems.
In 2012, Lawrence Livermore scientists ran scalability benchmarks across Zin and several other systems in the Tri-Labs complex, including other TLCC units and capability machines, such as Cielo at Sandia National Laboratories. The results were presented at SC12 in November. Figure 9 graphs the results.
Of the six systems in the comparison, Cielo, Purple, and Dawn are capability MPP machines, while Sierra, Muir, and Zin are capacity clusters—all three using Intel True Scale Fabric networks.
In this graph, the lower and flatter the scalability line, the better. A slope of 0 indicates ideal scalability. The three most scalable systems (Sierra, Muir, and Zin) were interconnected with Intel True Scale Fabric components. Zin outperforms the other two systems built on custom inter-connects. We note that Cielo is the capa-bility supercomputer at Sandia against which Chama competed so well.
SUMMARY And ConClUSIonSAcross three NNSA national laboratories, TLCC and TLCC2 capacity computing systems powered by Intel True Scale Fabric networks and Intel Xeon processor E5-2600 outperform other machines, including MPP capability supercomput-ers. At Sandia National Laboratories, Chama delivers “unprecedented scalability and performance.” Users of Luna at Los Alamos National Laboratory claim it is the “best machine the Laboratory has ever had,” and it is “changing the way we work.”
Zin at Lawrence Livermore National Labo-ratory, along with two other TLCC clusters built with Intel True Scale Fabric, dominate the scalability testing results of the most scalable systems in the benchmark.
These tests reveal how Intel True Scale Fabric with PSM and onload processing outperform other interconnects used in HPC and drive some of the fastest super-computers in the world.
Figure 9 . Scaling Results of Zin and Other Tri-Labs Machines.
0.0000
0.2000
0.4000
0.6000
0.8000
1.0000
1.2000
1.4000
1.6000
1.8000
2.0000
0 5000 10000
MIC
ROSE
CON
DS P
ER Z
ON
E-IT
ERA
TIO
N
PROCESSORS (CPUs)
SLOPESPurple - 0.000079Dawn (BG/P) - 0.000016Zin - 0.000012Cielo - 0.000010Sierra - 0.000008Muir - 0.000005Note: 0 slope is ideal scaling
INTEL® TRUE SCALE FABRICMuir - Full QDRSierra - Full QDRZin - Full (16 MPI/Node)
Cielo - PGI Full (16 MPI/Node) Muir - Full QDR - Intel® True Scale Fabric
Sierra - Full QDR - Intel® True Scale Fabric Zin - Full (16 MPI/Node) - Intel® True Scale Fabric Dawn - 2.2
Purple at Retire (NewComm)
Source: Lawrence Livermore National Laboratory, Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase. See Top500 list for configuration information at http://top500.org.
LOWER AND FLATTER IS BETTER
Weak Scaling – 3d Radiation Problem’s Average Zone-Iteration grind Time Per Machine
Intel® True Scale Fabric Architecture: Three Labs, One Conclusion
Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PAT-ENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current character-ized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel’s Web site at www.intel.com.
1 www.top500.org/list/2012/11 2 www.sandia.gov/about/index.html 3 Rajan, M. and D.W. Doerfler, P.T. Lin, S.D. Hammond, R.F. Barrett, and C.T. Vaughan. “Unprecedented Scalability and Performance
of the New NNSA Tri-Lab Linux Capacity Cluster 2,” Sandi National laboratories. 4 www.lanl.gov/mission/index.php 5 Pakin, Scott and Michael Lang. “Performance Comparison of Luna and Typhoon,” Los Alamos National Laboratory High-Performance
Computing Division, November 19, 2012. 6 Coulter, Susan and Daryl W. Grunau. “Typhoon IB Performance,” Los Alamos National Laboratory, March 8, 2013. 7 Bracketed values are added by Intel to offset the report results as described earlier.
Copyright © 2013 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Printed in USA 0513/ML/HBD/PDF Please Recycle 328985-001US