Intel True Scale Fabric Architecture: Three Labs, One Conclusion · Red Sky Chama 1e+06 1e+07 1e+08...

WHITE PAPERIntel® True Scale Fabric ArchitectureSupercomputing

Intel® True Scale Fabric Architecture: Three Labs, One ConclusionIntel® True Scale Fabric in National Laboratory Systems Changes the View of Interconnects and the Work in Supercomputing

EXECUTIVE SUMMARY

Three United States national laboratories, known for their work in supercomputing,

recently benchmarked new systems delivered to each of them. These systems are

built on the Intel® Xeon® processor E5-2600 family and Intel® True Scale Fabric, based

on InfiniBand* and the open source Performance Scale Messaging (PSM) interface.

The scientists performing the benchmarks concluded in individual reports that

Intel True Scale Fabric:

· Contributed to “unprecedented scalability and performance” in their systems, and

it is allowing them to change how they work.

· Outperforms on some tests one of the most powerful, customized supercomputers

in the world, rated 18 in the November, 2012 Top500 list.1

· Delivers a level of performance they had not seen from a commodity

interconnect before.

The new systems, named Chama (at Sandia National Laboratories), Luna (at Los Alamos

National Laboratory), and Zin (at Lawrence Livermore National Laboratory), are part

of the Tri-Labs Linux* Capacity Clusters 2 (TLCC2) in the Advanced Simulation and

Computing (ASC) program under the National Nuclear Security Administration (NNSA).

This paper summarizes the findings of the reports from these three laboratories.

TAblE oF ConTEnTSExecutive Summary . . . . . . . . . . . . . . 1

When Fast is not Fast Enough . . . . . 2

Intel a Force in HPC . . . . . . . . . . . . . . 2

Key Components to a Fast System . . . . . . . . . . . . . . . . . . .2

Intel, TlCC, and TlCC2 . . . . . . . . . . . . 2

“Unprecedented Scalability and Performance” – Chama at Sandia national laboratories . . . . . . 3

Key Findings . . . . . . . . . . . . . . . . . . . . . . .4

Messaging Micro-benchmarks . . . . . .4

- Bandwidth and Latency . . . . . . . .4

- MPI Message Rate . . . . . . . . . . . . . .5

- Random Message Bandwidth . . .5

- Global Communications – - MPI Allreduce . . . . . . . . . . . . . . . . . .5

Application Testing. . . . . . . . . . . . . . . . .5

- Cielo Acceptance Benchmarks . .7

Sandia’s Conclusions . . . . . . . . . . . . . . .7

“Changing the Way We Work” – luna at los Alamos national laboratory . . . . . . . . . . . . . . 8

Key Findings . . . . . . . . . . . . . . . . . . . . . . .8

Application Testing. . . . . . . . . . . . . . . . .9

Communications Micro-benchmarks . . . . . . . . . . . . . . . . .9

- Node-to-Node Bandwidth - and Adapter Contention . . . . . . 10

- Global Communications – - MPI_Allreduce Results . . . . . . . . 10

Los Alamos’ Conclusions . . . . . . . . . . 10

Supreme Scalability – Zin at lawrence livermore national laboratory . . . . . . . . . . . . . 11

Summary and Conclusions . . . . . . . 11

2

Intel® True Scale Fabric Architecture: Three Labs, One Conclusion

WHEn FAST IS noT FAST EnoUgHInfiniBand Architecture has proven itself over the years as the interconnect technology of choice for high-performance computing (HPC). For a commodity inter-connect, it continues to achieve perfor-mance advances above other industry standard networks, and it outperforms them by a significant factor. But, when it comes to the demands of HPC and MPI message passing, fast is never fast enough. While MPI, using InfiniBand Verbs, delivers fast communications, there is a costly overhead with Verbs and traditional offload processing on InfiniBand Host Channel Adapters (HCAs) that hinders scalability with larger core counts.

Intel True Scale Fabric, with its open source Performance Scale Messaging (PSM) interface and onload traffic processing, was designed from the ground up to accel-erate MPI messaging specifically for HPC. Intel True Scale Fabric delivers very high message rates, low MPI latency and high effective application bandwidth, enabling MPI applications to scale to thousands of nodes. This performance drove the choice of interconnect for the most recent ac-quisitions in the Advanced Simulation and Computing (ASC) Program’s Tri-Labs Linux Capacity Clusters 2 (TLCC2): Chama (at Sandia National Laboratories), Luna (at Los Alamos National Laboratory), and Zin (at Lawrence Livermore National Laboratory).

InTEl, TlCC, And TlCC2The ASC Program under the National Nuclear Security Administration (NNSA) provides leading-edge, high-end simulation capabilities to support the Administration’s mission. Some of the fastest supercomput-ers in the world are managed under the ASC at the three NNSA laboratories: Los Alamos National Laboratory, Sandia National Laboratories, and Lawrence Livermore National Laboratory. These machines include “capacity” and “capability” HPC systems designed for a range of computing jobs and users.

Capacity and capability machines are generally distinguished by their differences in size and users. While both categories have grown in computing abilities over the years, capability systems are typically dedicated to a smaller group of users and are much larger, comprising a number of cores as much as a magnitude higher than capacity machines (hundreds of thousands compared to tens of thousands of cores).

The Tri-Lab Linux Capacity Clusters (TLCC) contribute to capacity computing at the three NNSA laboratories. TLCC is designed for scalability to adapt the resources to each job’s computing requirements, while running multiple jobs simultaneously. Thus, the systems consist of a number of Scal-able Units (SU), each SU comprising 162 compute, user and management nodes, 2,592 cores and delivering about 50 tera-FLOPS/SU. One TLCC procurement included the supercomputer Sierra, built with Intel True Scale Fabric components, housed at Lawrence Livermore National Laboratory.

The more recent procurements for the sec-ond procurement of scalable Linux clusters, TLCC2, consist of three large Linux clusters, one each housed at an NNSA laboratory:

• Chama – 8 SUs, with 1,296 nodes, located at Sandia National Laboratories in Albuquerque, New Mexico

InTEl, A FoRCE In HPCIntel has a long history in high-performance computing (HPC) systems and the national laboratories that use them. Intel built the first massively parallel processing (MPP) machine to reach one teraFLOP, and delivered it in 1996 to the Advanced Computing and Simulation (ASC) Program (formerly ASCI) as Option Red. Intel continues to be a driving force in supercomputing with Intel® proces-sors in more systems on the Top500 list1 of the world’s fastest supercomputers than any other manufacturer. But it takes more than just a fast processor to live among the fastest 500 systems.

Key Components to a Fast System

The fastest systems use more than just Intel processors. Intel provides the components and software tools to help achieve the highest performing codes on some of the nation’s most critical computing jobs.

• Intel® Xeon® processors – 377 (75 percent) of the top Top500 supercomputers use Intel® Architecture processors.

• Intel® True Scale Fabric – designed specifically for HPC in order to minimize communications and enable efficient systems, Intel True Scale Fabric enables the fastest clusters based on InfiniBand* Architecture.

• Intel® Xeon Phi™ coprocessors – built on many-core architecture, Intel Xeon Phi coprocessors offer unparalleled acceleration for certain codes.

• Intel® Software Tools – a host of tools support cluster builders and application programmers to make their codes fast and efficient.

• Intel® Storage Systems – HPC demands the fastest components, and Intel storage components deliver both speed and reliability.

3


• Luna – 10 SUs, with 1,620 nodes, located at Los Alamos National Laboratory in Los Alamos, New Mexico

• Zin – 18 SUs with 2,916 nodes, located at Lawrence Livermore National Laboratory in Livermore, California

All three machines are built around Intel® technologies, including Intel® Xeon® processors and Intel True Scale Fabric HCAs and switches. At all three laborato-ries, users and laboratory scientists have reported significant performance and scalability improvements over other machines, triggering scientists to take a new look at how their work gets done.

“UnPRECEdEnTEd SCAlAbIlITY And PERFoRMAnCE” – CHAMA AT SAndIA nATIonAl lAboRAToRIESSandia National Laboratories, headquar-tered in Albuquerque, New Mexico, has, over the last six decades, “delivered es-sential science and technology to resolve the nation’s most challenging security issues.”2 Sandia has a long history of high-performance computing. It is the home of the nation’s first teraFLOP supercomputer, ASCI Option Red, built by Intel in 1996. As one of the laboratories providing capacity computing to the NNSA ASC program, it received its latest TLCC2 capacity machine, Chama, in 2012.

With the acquisition of Chama, users began reporting 2x to 5x performance improve-ment on their jobs. Sandia scientists wanted to “understand the characteristics of this new resource.” So, they performed micro-benchmarks and application program testing on Chama and two other systems at Sandia: Red Sky, another capacity computing machine and predecessor to Chama in the TLCC, and Cielo, a capabil-ity supercomputer. Their findings are captured in their report.3

Table 1 lists the system configurations for Chama, Red Sky, and Cielo.

ConFIgURATIon CHAMA REd SKY CIElo

COmpuTe COmpLex

Total Computing Nodes

1,232 2,816 8,894

processor Architecture

Intel® Architecture formerly codenamed Sandy Bridge

Intel® Architecture formerly codenamed Nehalem

AmD magnyCours*

Cache L1 (KB) L2 (KB) L3 (mB)

8 x 32 8 x 256 20

4 x 32 4 x 256 8

8 x 64 8 x 512 10

Cores/Node 16 8 16

Total Cores 19,712 22,528 142,304

Clock Speed (GHz) 2.60 2.93 2.40

Instruction Set Architecture (ISA)

Intel® AVx SSe4.2 SSe4a

memory DDR3 1600 mHz DDR3 1333 mHz DDR3 1333 mHz

memory/Core (GB) 2 1.5 2

Channels/Socket 4 3 4

peak Node GLFOpS 332.8 94.76 153.6

INTeRCONNeCT

manufacturer Technology/Rate IB HW Interface Topology

Intel (Qlogic) InfiniBand* QDR pSm Fat Tree

mellanox* InfiniBand QDR Verbs 3D Torus: 6 x 6 x 8

Gemini* Custom Custom 3D Torus: 18 x 12 x 24

Table 1 . Sandia National Laboratories Test Systems.

4


CHA

MA

/CIE

LO C

OM

PARI

SON

15

1

0.5

2

2.5

3

3.5

4

4.5

5

0101 100 1000 10000 100000 1e+06 1e+07

MESSAGE SIZE (BYTES)

1 Task/node2 Tasks/node

4 Tasks/node 16 Tasks/node

8 Tasks/node

COMPARISON

LOWER IS BETTER

15

2

2.5

3

3.5

4

4.5

0168 32 64 128 256 512 1024

MIC

ROSE

CON

DS


Cielo (X&Z)Cielo (Y)

Red Sky

Chama

1e+06

1e+07

1e+08

1e+09

10000010 100 1000 10000

BYTE

S/SE

CON

D/M

PI T

ASK

MPI RANKS

Cielo

Red Sky

Chama

HIGHER IS BETTER

Key Findings

Sandia scientists tested the systems across a range of characteristics beyond those impacted by interconnect, including memory performance and contention, processor performance, and more. Chama proved to be a well-balanced system with impressive performance results that outperformed Red Sky and compared well against Cielo. However, this paper focuses on the results of interconnect benchmarks and application testing to understand how interconnect contributes to the overall HPC performance. Thus, the tests revealed the following about the Intel True Scale Fabric interconnect:

• Chama returned unprecedented results in MPI messaging rate at message sizes up to 1 KB, outperforming even Cielo’s custom interconnect.

• Chama delivered random messaging bandwidth the scientists had not yet seen from a commodity interconnect, exceeding Cielo by as much as 30 percent.

• Collectives performance scaling for Chama compares well against the custom interconnect of Cielo, both outperform-ing Red Sky by an order of magnitude.

• Chama scaled well against Cielo on three Sandia Finite Element production appli-cations, which revealed severe scaling limitations on Red Sky.

The key findings from these micro- benchmarks and application tests indicate that Chama, with its Intel True Scale Fabric, “has a strong impact on applications” as attested by Chama users.

Messaging Micro-benchmarks

While standard traditional metrics include inter-node latency and bandwidth, Sandia scientists were keenly interested in Chama’s MPI messaging rate and scalable random message bandwidth performance. Figure 1 shows the benchmark results for these tests.

bandwidth and latency

Sandia codes are more sensitive to band-width than latency; this effect drove the choice for Chama’s Intel True Scale Fabric interconnect. As shown in Figures 1a and 1b, Chama performed well compared to Cie-lo’s custom Gemini* interconnect, according to Sandia scientists. We note that with sizes well within the typical HPC message size space, Red Sky’s bandwidth climbed much more slowly, remaining about half of Chama’s, and latency began to dramatically increase at just 64-byte messages.

Figure 1 . Sandia Inter-node MPI Performance.

C . Message Rate

A . Bandwidth

d . Random Messaging Bandwidth

b . Latency

500

1000

1500

2000

2500

3000

3500

4000

0101 100 1000 10000 1000000 1e+06 1e+07

MBY

TES/

SECO

ND


Cielo (X&Z)Cielo (Y)

Red Sky

Chama

HIGHER IS BETTER

5


1

10

100

1000

10000

0.141 16 64 163844096256 1024

AV

ERA

GE T

IME,

MIC

ROSE

CS

MPI RANKS

Cielo

Red Sky

Chama

LOWER IS BETTER

1

10

100

1000

0.141 16 64 163844096256 1024

AV

ERA

GE T

IME,

MIC

ROSE

CS

MPI RANKS

Cielo

Red Sky

Chama

LOWER IS BETTER

MPI Message Rate

Of particular interest to the testers at Sandia, was the ability of the intercon-nect to process messages as core counts increased. HCA congestion on multi-core nodes is “becoming a significant constraint” in HPC with commodity interconnects, even those based on InfiniBand Architecture. “Therefore, the most important internode behavior for Chama is the significant gain in MPI message rate in comparison to Cielo.”

For message sizes up to 1 KB, the Intel True Scale Fabric outperformed the cus-tom interconnect of Cielo by 2x to 4x. For Sandia, this was an unprecedented event, which “…can have a significant positive impact on many applications, such as those that employ a sparse solver kernel.”

Random Message bandwidth

Not all inter-node communications are structured. Indeed, many applications, such as Charon, induce unstructured communications across the fabric. Understanding node behavior using a measure of random message traffic can more readily predict system performance with such codes. Sandia uses a random messaging benchmark for understanding

scalability in commodity clusters. The test “sends thousands of small messages from all MPI tasks with varying message sizes (100 bytes to 1 KB) to random MPI rank destinations.” An aggregate average ran-dom messaging bandwidth (Figure 1d) was derived from per process measurements.

The measurements showed the following results, which the scientists had never seen with commodity interconnects bench-marked against a custom architecture:

• Red Sky, compared to Chama, performed from 10x slower (32 cores) to 220x slower (8,192 cores)

• Chama was 20 to 30 percent faster than Cielo, the capability supercomputer

Chama’s Intel True Scale Fabric scales ex-tremely well with applications that create random traffic on large systems.

global Communications – MPI Allreduce

For understanding behavior of Chama with applications that are sensitive to collective operations, Sandia averaged scalability per-formance data from a thousand trials using 8, 64, and 1024 byte transfers. As shown in Figure 2, Chama performs competitively to Cielo across all ranks. Both perform an

order of magnitude better than Red Sky in some cases, with Red Sky’s performance falling off above 1 KB messages.

Application Testing

With respect to Red Sky, the above benchmarks highlight the discoveries of previous studies Sandia performed on commodity clusters like Red Sky and Chama, namely the poor scalability with applications that use implicit solvers and the poor parallel efficiency with higher amounts of unstructured message traffic. (Characteristics not exhibited by Chama in the micro-benchmarks.)

These results and other discoveries in previous commodity clusters provided a “strong case” for Sandia to invest in more custom MPP machines. However, users of Chama have reported performance improvements with their application codes of 2x to 5x. To further understand these experiences, scientists proceeded with application testing.

Figure 2 . IMB MPI_Allreduce Performance.

10

100

1000

10000

141 16 64 163844096256 1024

AV

ERA

GE T

IME,

MIC

ROSE

CS

MPI RANKS

Cielo

Red Sky

Chama

LOWER IS BETTER

A . 8 bytes b . 64 bytes C . 1024 bytes

6


Table 2 lists four of the five applications used, along with their results, to help re-veal how Chama compared to Red Sky and Cielo at scale. These results are consistent with users’ experiences.

Figure 3 graphs the results for the Finite Element Method tests Aleph, Aria, and Charon; Figure 4 shows the performance for AMG2006. Again, Red Sky exhibits severe scaling limitations, while Chama outperforms Cielo on all tests.

APPlICATIonSCIEnCE doMAIn

KEY AlgoRITHM

TIMIng METRIC

MAXIMUM PERFoRMAnCE IMPRoVEMEnT AT SCAlE

CHAMA: REd SKY

CHAMA: CIElo

Aleph plasma simulation Finite element method (Fem) particle move + field solves

Weak scaling, fixed number of steps

4.2x 1.3x

AmG2006 Algebraic multigrid Laplace solver, preconditioned Conjugate Gradient

Weak scaling, 100 iterations

1.5x 1.75x

Aria CFD, Thermodynamics

Implicit Fem Strong scaling, 25 time steps

3.4x 2.6x

Charon Semiconductor device simulation

Implicit Fem Weak scaling, fixed number of iterations

2.5x 1.6x

Table 2 . Sandia Application Scaling Tests.

Figure 4 . AMG2006 Scaling Comparisons.

100

50

150

250

200

0641641 256 1638440961024

PCG

SOLV

E TI

ME

(SEC

S)

MPI RANKS

Cielo Red SkyChama

LOWER IS BETTER

Figure 3 . Charon, Aleph, and Aria Application Scaling.

0.3

0.25

0.15

0.2

0.35

0.4

0.45

0.5

0.16432 128 256 1638440962048 8192512 1024

ML/

AZT

EC T

IME

PER

BICG

STA

B LT

R (S

ECS)

MPI RANKS

Cielo

Red Sky

Chama

LOWER IS BETTER

2500

2000

1000

1500

3000

3500

4000

4500

500100 100001000

TIM

E (S

ECS)

MPI RANKS

Cielo

Red Sky

Chama

LOWER IS BETTER

80

60

20

40

100

120

01 1000100

TIM

E (S

ECS)

MPI RANKS

10

Cielo

Red Sky

Chama

LOWER IS BETTER

A . Charon

b . Aleph

C . Sierra/Aria

7


Cielo Acceptance benchmarks

A number of other applications were benchmarked on Chama, not covered by the current Sandia report. However, results of four of the six Tri-Lab Cielo acceptance benchmarks were included. They are shown in Figure 5. While “not as spectacular” as the earlier tests, Sandia scientists considered these results good.

Sandia’s Conclusions

Sandia scientists stated the results for Chama’s Intel True Scale Fabric perfor-mance to be “unprecedented” and “never before seen” for a commodity interconnect. With its onload processing and its PSM interface, Chama’s Intel True Scale Fabric outperformed Red Sky’s verbs-based Infini-Band communications and was competitive with the capability supercomputer Cielo. MPI profiles revealed that Chama’s faster MPI processing of the Intel True Scale Fabric contributed to its scalability and to the 2x to 5x performance improvement experienced by Chama’s users.

Figure 5 . Cielo Acceptance Test Performance Summary.

SAGE

AMG2006

UMT

HPCCG

CTH

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.60.0

PERFORMANCE GAIN FACTOR: CHAMA OVER RED SKY

1024 MPI Tasks 128 MPI Tasks 16 MPI Tasks

SAGE

AMG2006

UMT

HPCCG

CTH

0.5 1.0 1.5 2.0 2.50.0

PERFORMANCE GAIN FACTOR: CHAMA OVER CIELO

1024 MPI Tasks 128 MPI Tasks 16 MPI Tasks

“The performance gains that Sandia users are experiencing

with their applications on…Chama, has resulted in many positive

feedbacks from happy users. …we are seeing unprecedented

performance and scalability of many key Sandia applications.”

8


CATEgoRY PARAMETER TYPHoon lUnA

Cpu core make model Clock speed L1 data cache size L2 cache size

AmD magny-Cours* Opteron* 6128 2.0 GHz 64 KB 0.5 mB

Intel® Sandy Bridge Intel® xeon e5-2670 2.6 GHz 32KB 0.25mB

Cpu socket Cores Shared L3 cache size memory controllers

8 12 mB 4 x DDR3-1333

8 20 mB 4 x DDR3-1600

Node Sockets memory capacity

4 64 GB

2 32 GB

Network make Type

mellanox* QDR InfiniBand (Verbs)

Intel® True Scale Fabric QDR InfiniBand (pSm)

System Integrator Compute nodes I/O nodes Installation date

Appro 416 12 march 2011

Appro 1540 60 April 2012

“CHAngIng THE WAY WE WoRK” – lUnA AT loS AlAMoS nATIonAl lAboRAToRYLos Alamos National Laboratory has a nearly 70 year history of discovery and innovation in science and technology. Its mission is to “develop and apply science and technology to ensure the safety, security, and reliability of the U.S. nuclear deterrent; to reduce global threats; and solve other emerging national security and energy challenges.”4

In 2012, it acquired Luna as part of TLCC2, and “reports from users have been extremely positive.” In particular, two directed stockpile work (DSW) prob-lems completed by users Mercer-Smith and Scott ran 3.9x and 4.7x faster on Luna than other systems. Scientists at Los Ala-mos were asked to understand why Luna performed so much better. Their research is captured in benchmarks and application testing between Luna and Typhoon.5

Table 3 lists the configurations of the two systems used in the evaluation.

Key Findings

Los Alamos scientists performed applica-tion tests to compare performance and scalability plus micro-benchmarks to help understand what makes the systems per-form differently. As at Sandia, the tests were comprehensive across a variety of characteristics; however, this paper focuses on the results of interconnect micro-benchmarks and application testing.

We note that the authors discovered Typhoon exhibited atypical InfiniBand bandwidth performance during the single-node communication micro-benchmark. This led to a later evaluation of Typhoon’s InfiniBand performance and an ensu-ing report.6 The findings revealed that a configuration problem caused a lower than expected InfiniBand performance on Typhoon. When corrected and the ap-plication xRAGE used in the current tests was rerun, Typhoon improved by about 21 percent on xRAGE. Whether or not this handicap carried across to all Typhoon tests is unclear. Thus, in this paper, where

appropriate, we awarded Typhoon a 21 percent benefit and present the resultant values in parentheses next to the original report’s results. Nonetheless, Luna gener-ally outperformed Typhoon on every test and micro-benchmark Los Alamos performed, with some variability.

The Los Alamos tests revealed the following:

• Across several comparisons, Luna rates from 1.2x to 4.7x faster than Typhoon.

• Luna’s interconnect supports nearly full InfiniBand QDR bandwidth with little to no contention scaling to 16 cores, while Typhoon starts out fast and degrades steadily to 32 cores without achieving nearly full InfiniBand speeds.

• At 16 cores, Luna’s Intel True Scale Fabric is 2.10x (1.74x) faster than Typhoon; at 32 cores, the difference rises to 2.19x (1.81x) faster.

• Collectives performance showed Luna with an average of 1.95x (1.61x) improve-ment over Typhoon, but with variability.

The key findings from these micro- benchmarks and application tests indicate that Luna, with its Intel True Scale Fabric, delivers a wide range of performance improvements over Typhoon.

Table 3 . Los Alamos National Laboratory Test Systems.

“Luna is the best machine

that the laboratory

has ever had.”

9


APPlICATIon/SoURCE lUnA:TYPHoon

dESCRIPTIon

Theoretical peak memory bw

1.2x This is the simple ratio of Luna’s memory to Typhoon’s memory bandwidth

xRAGe 1.56x (1.29x) A collectives-heavy code

eAp test suite 1.69x (1.40x) A collection of 332 regression tests from the eulerian Applications project (eAp) run nightly on Luna and Typhoon

mizzen problem 2.07x (1.71x) An integrated code representative of the types of codes normally run on Luna and Typhoon

Theoretical compute rate

2.6x Calculated maximum theoretical FLOps

High-performance Linpack* benchmark

2.72x According to the June 2012 Top500 list

partisn, sn timing 2.75x (2.28x) A more communications-active code compared to xRAGe, with many small message exchanges

ASC1 code (mercer-Smith & Scott)

3.9x DSW problem; not part of the current testing

ASC2 code (mercer-Smith & Scott)

4.7x DSW problem; not part of the current testing

Application Testing

Los Alamos scientists performed four application tests with variations on the number of cores and nodes for different tests. They tried to thoroughly under-stand what drives Luna’s significant improvements and attempted to repeat the improvements Mercer-Smith and Scott experienced. The tests and source of other metrics are briefly described in Table 4 along with the results.

The extent of their comprehensive test-ing is beyond the capacity of this paper; therefore only the results are summarized below, shown in Figure 6.

Using theoretical calculations, actual mea-surements and the experiences reported by users, Luna averages about 2.5x faster than Typhoon.

Communications Micro-benchmarks

As with Sandia, Los Alamos scientists ran several micro-benchmarks to isolate some of the causes of Luna’s performance edge over Typhoon. Los Alamos tests also iso-lated several improvements at the node and processor architectural levels. But, again, this paper focuses on the results that the interconnect contributed to the overall performance.

Table 4 . Application Test Descriptions.

Figure 6 . Luna: Typhoon Applications Performance Summary.

INTEGRATED CODE, MIZZEN PROBLEM (BROWN)

THEORETICAL PEAK COMPUTE RATE

HIGH-PERFORMANCE LINPACK BENCHMARK

ASC1 CODE/PROBLEM (MERCER-SMITH & SCOTT)

ASC2 CODE/PROBLEM (MERCER-SMITH & SCOTT)

PARTISN, sn TIMING (PAKIN & LANG)

EAP TEST SUITE, GEOMETRIC MEAN (EAP TEAM)

THEORETICAL PEAK MEMORY BANDWIDTH

xRAGE, ASTEROID PROBLEM (PAKIN & LANG)

0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

LUNA: TYPHOON PERFORMANCE RATIO, 128 MPI RANKS

1.2

1.56

1.69

2.07

2.6

2.72

2.76

3.9

4.7

10


los Alamos’ Conclusions

Luna outperforms Typhoon from 1.2X to 4.7x, as indicated by both theoretical and actual results. The authors conclude that “…almost all key components of Luna—CPUs, memory and network—are faster than their Typhoon counterparts, but by widely varying amounts (and in nonlinear patterns) based on how these hardware resources are utilized.”

Indeed, Luna is considered the best machine the Laboratory owns by one set of users. Other user experiences are quite positive, to the point that it is having an impact on some work going forward.

global Communications – MPI Allreduce Results

For collectives performance, the Los Alamos authors created a micro-benchmark that reports the average time per MPI_Allreduce operation for various message sizes across 128 MPI ranks. Figure 8 graphs the results of Luna’s to Typhoon’s performance. The authors note “…the geometric mean of the measurements indicate that Typhoon takes an average (horizontal line) of 1.95x (1.61x) as long as Luna to perform an MPI_Allreduce…” However, they also are drawn to the vari-ability of the results. They consider it, like other results in their study, that “there is a large set of corner cases where Luna can be many times faster than Typhoon—and some applications may in fact hit these cases—but more modest speedups are the more common case.”

node-to-node bandwidth and Adapter Contention

This micro-benchmark exchanges a large volume of data between two nodes, starting with a single core on each node and scaling to all 16 cores on a node (for Luna) or 32 cores (for Typhoon). The test records the bandwidth consumed for each exchange. Figure 7 charts the results.

For Luna, the first exchanges do not satu-rate the network, but within four cores, full speed is achieved at 3,151 B/μs and held across all 16 cores, with little measur-able degradation from contention. This was also seen at Sandia where the messaging rate scaled well across many MPI ranks. Typhoon’s network, however, while start-ing out faster than Luna at 1,879 B/μs, degraded steadily to 1,433 B/μs as core count increased, indicative of contention as the adapter tries to pass traffic from more cores.

The scientists determined, “while Luna’s per-core (aggregate divided by number of cores) communication bandwidth is 2.10x [(1.74x)]7 that of Typhoon’s at 16 cores/node, this ratio increases to 2.19x [(1.81x)] when comparing a full Luna node to a full Typhoon node.”

Figure 7 . Network Bandwidth as a Function of Contention for the NIC. Figure 8 . Ratio of Luna’s MPI_Allreduce latency to Typhoon’s for 128 MPI Ranks.

0

1,000

2,000

3,000

4,000

4 8 12 16 20 24 28 32

AGG

REGA

TE C

OM

MU

NIC

ATI

ON

BA

NDW

IDTH

(B/µ

s)

NUMBER OF COMMUNICATING PAIRS OF PROCESSES

Typhoon Luna Theoretical Peak

7

6

5

4

3

2

1

022 24 26 28 210 212 214 216 218 220

TYPH

OO

N: L

UN

A M

PI_A

LLRE

DUCE

LA

TEN

CY

MESSAGE SIZE (W)

“Luna tends to be about twice as fast as

Typhoon across the various micro-benchmarks,

but there are many outliers.”

11


SUPREME SCAlAbIlITY – ZIn AT lAWREnCE lIVERMoRE nATIonAl lAboRAToRYBeginning operations in 1952, Lawrence Livermore National Laboratory has grown into a diverse complex of science, research, and technology, part of which supports the ASC Program and missions of the NNSA. The Terascale Simulation Facility (TSF) at Lawrence Livermore National Laboratory houses TLCC2 clusters and includes the world’s second fastest supercomputer, Sequoia, according to the Top500 list.1

In 2011, the Laboratory acquired Zin, the latest addition to its TLCC2. The Zin cluster comprises 2,916 nodes, 46,208 cores, and Intel True Scale Fabric network. Soon after it was delivered in 2011, it was awarded number 15 on the Top500 list of the fast-est supercomputers in the world. A year later, it is still in the top 30 fastest systems.

In 2012, Lawrence Livermore scientists ran scalability benchmarks across Zin and several other systems in the Tri-Labs complex, including other TLCC units and capability machines, such as Cielo at Sandia National Laboratories. The results were presented at SC12 in November. Figure 9 graphs the results.

Of the six systems in the comparison, Cielo, Purple, and Dawn are capability MPP machines, while Sierra, Muir, and Zin are capacity clusters—all three using Intel True Scale Fabric networks.

In this graph, the lower and flatter the scalability line, the better. A slope of 0 indicates ideal scalability. The three most scalable systems (Sierra, Muir, and Zin) were interconnected with Intel True Scale Fabric components. Zin outperforms the other two systems built on custom inter-connects. We note that Cielo is the capa-bility supercomputer at Sandia against which Chama competed so well.

SUMMARY And ConClUSIonSAcross three NNSA national laboratories, TLCC and TLCC2 capacity computing systems powered by Intel True Scale Fabric networks and Intel Xeon processor E5-2600 outperform other machines, including MPP capability supercomput-ers. At Sandia National Laboratories, Chama delivers “unprecedented scalability and performance.” Users of Luna at Los Alamos National Laboratory claim it is the “best machine the Laboratory has ever had,” and it is “changing the way we work.”

Zin at Lawrence Livermore National Labo-ratory, along with two other TLCC clusters built with Intel True Scale Fabric, dominate the scalability testing results of the most scalable systems in the benchmark.

These tests reveal how Intel True Scale Fabric with PSM and onload processing outperform other interconnects used in HPC and drive some of the fastest super-computers in the world.

Figure 9 . Scaling Results of Zin and Other Tri-Labs Machines.

0.0000

0.2000

0.4000

0.6000

0.8000

1.0000

1.2000

1.4000

1.6000

1.8000

2.0000

0 5000 10000

MIC

ROSE

CON

DS P

ER Z

ON

E-IT

ERA

TIO

N

PROCESSORS (CPUs)

SLOPESPurple - 0.000079Dawn (BG/P) - 0.000016Zin - 0.000012Cielo - 0.000010Sierra - 0.000008Muir - 0.000005Note: 0 slope is ideal scaling

INTEL® TRUE SCALE FABRICMuir - Full QDRSierra - Full QDRZin - Full (16 MPI/Node)

Cielo - PGI Full (16 MPI/Node) Muir - Full QDR - Intel® True Scale Fabric

Sierra - Full QDR - Intel® True Scale Fabric Zin - Full (16 MPI/Node) - Intel® True Scale Fabric Dawn - 2.2

Purple at Retire (NewComm)

Source: Lawrence Livermore National Laboratory, Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase. See Top500 list for configuration information at http://top500.org.

LOWER AND FLATTER IS BETTER

Weak Scaling – 3d Radiation Problem’s Average Zone-Iteration grind Time Per Machine


Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL’S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PAT-ENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current character-ized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or by visiting Intel’s Web site at www.intel.com.

1 www.top500.org/list/2012/11 2 www.sandia.gov/about/index.html 3 Rajan, M. and D.W. Doerfler, P.T. Lin, S.D. Hammond, R.F. Barrett, and C.T. Vaughan. “Unprecedented Scalability and Performance

of the New NNSA Tri-Lab Linux Capacity Cluster 2,” Sandi National laboratories. 4 www.lanl.gov/mission/index.php 5 Pakin, Scott and Michael Lang. “Performance Comparison of Luna and Typhoon,” Los Alamos National Laboratory High-Performance

Computing Division, November 19, 2012. 6 Coulter, Susan and Daryl W. Grunau. “Typhoon IB Performance,” Los Alamos National Laboratory, March 8, 2013. 7 Bracketed values are added by Intel to offset the report results as described earlier.

Copyright © 2013 Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Printed in USA 0513/ML/HBD/PDF Please Recycle 328985-001US

Intel True Scale Fabric Architecture: Three Labs, One Conclusion · Red Sky Chama 1e+06 1e+07 1e+08...

Documents

Transcript of Intel True Scale Fabric Architecture: Three Labs, One Conclusion · Red Sky Chama 1e+06 1e+07 1e+08...