Hpc Overview

Top High Performance Computing Systems

A Short Overview

CRAY - JAGUAR

Jaguar was a pentascale supercomputer built by Cray at Oak Ridge National Laboratory that had a peak

performance of 1.75 PFlops. In 2012 XT5 Jaguar was replaced by Titan which is currently ranked as the

worlds fastest computer and is a Cray XK7 system. Key features of Jaguar follow next and then the

differences of its upgrade, Titan.

System Specifications (most important):

Compute Node:

2 AMD Opteron 2435 Hex-Cores

Cray specific SeaStar network

Improves acalability

Low latency, high bandwidth

Interconnect

Upgradable processor, memory and

interconnect

Topology:

3D torus network

192 IO nodes distributed across the torus to prevent hot-spots, connected to the storage via

Infiniband

Fabric connections provide redundant paths

Processor 224256 AMD Opteron Cores @ 2.6 GHz Memory 300 TB Cabinets 200 (8 rows of 25 cabinets) Compute Nodes 18.688 Performance Peak performance 1.75 PFlops Size 4.600 feet2 Power 480 volt power/cabinet

Figure 1: XT5 Topology

Packaging Hierarchy:

Blade: 4 network connections

Cabinet(Rack): 192 Opteron Processors 776 Opteron Cores, 96 nodes

System: 200 cabinets

Figure 2: Building the Cray XT5 System

Titan - Main Differences:

Jaguar has been loaded up with 18.688 of the first shipping NVIDIA Telsa K20 GPUs and renamed

Titan

Compute node: Consists of a 16-core AMD Opteron 6274 processor and a K20 GPU accelerator

Memory: 710TB

Peak Performance: more than 20 PFlops

Titan Interconnect:

1 Gemini routing and communications ASIC per two nodes

48 switch ports per Gemini chip (160 GB/s internal switching capacity per chip)

3D torus interconnect

Gemini Interconnect:

Enables tens of millions of MPI messages/second

Each hybrid node is interfaced to the Gemini interconnect through HyperTransportTM

3.0

technlogy

This architecture bypasses the PCI bottlenecks and provides a peak of 20 GB/s of injection

bandwidth/node

Connectionless protocol Highly scalable architecture

Powerful bisection and global bandwidth

Dynamic routing of messages

Figure 3: Gemini Interconnect

IBM - Blue Gene/Q

Blue Gene/Q is the third generation of IBMs series of HPC systems, Blue Gene, following the Blue

Gene/L and /P architectures. It is a massively parallel computer and its purpose is to solve large scale

scientific problems, such as protein folding.

Key Features:

Small footprint

High power efficiency (worlds most efficient supercomputer - Green500/June 2011)

Low latency, high bandwidth inter-processor communication system

1.024 nodes/rack

Scales to 512 racks (currently implemented with 256) Peak Performance: 100 PF

Integrated 5D torus with optical cables for inter-processor communication Tremendous

bisection bandwidth


Processor IBM PowerPC A2 1.6 GHz, 16 cores per node Memory 16 GB SDRAM-DDR3 per node (1333 MTps) Networks 5D Torus 40 GBps; 2.5 sec latency

Collective network part of the 5D Torus; collective logic operations supported Global Barrier/Interrupt part of 5D Torus PCIe x8 Gen2 based I/O 1 GB Control Network System Boot, Debug, Monitoring

I/O Nodes (10 GbE or InfiniBand) 16-way SMP processor; configurable in 8,16 or 32 I/O nodes per rack

Performance Peak performance per rack - 209.7 TFlops Power Typical 80 kW per rack (estimated) 380-415, 480

VAC 3-phase; maximum 100 kW per rack; 4x60 amp service per rack

Dimensions Height: 2095 mm Width: 1219 mm Depth: 1321 mm Weight: 4500 lbs with coolant (LLNL 1 IO drawer configuration) Service clearances: 914 mm on all sides

Compute Chip:

System-on-a-Chip Design: integrates processors, memory and networking logic into a single chip

- 360 mm2 Cu-45 technology (SOI)

- 16 user + 1 service processors

- plus 1 redundant processor

- all processors are symmetric

- 1.6 GHz

- L1 I/D cache = 16kB/16kB

- L1 prefetch engines

- peak performance 204.8 GFlops @ 55W

- Central shared L2 cache: 32 MB

- Dual memory controller

- Chip-to-chip networking

- Router logic integrated into BQC chip

- External IO

Crossbar Switch:

Central connection structure between processors, L2 cache, networking logic and various low-

bandwidth units

Frequency: 800MHZ (half frequency clock grid)

3 separate switches:

Request traffic -- write bandwidth 12B/PUnit @ 800 MHz

Response traffic -- write bandwidth 12B/PUnit @ 800 MHz

Invalidate traffic

22 master ports

18 slave ports

Peak on-chip bisection bandwidth 563 GB/s

Networking Logic:

Communication ports:

11 bidirectional chip-to-chip links @ 2GB/s

2 links can be used for PCIe Gen2 x8

On-chip networking logic

Implements 14-port router

Designed to support point-to-point, collective and barrier messages


Basic Topology - Interconnect

Node oard

Connects 32 compute cards into a

compute integrated circuit and 72 memory chips

Signaling rate @

Boards on the same midplane are connected via the midplane connector

Boards on different midplanes are connected via 8 link chips on each midplane

These link chips drive optical transceivers (@10Gbps) through optical cables

Midplane

Midplane interconnectio

Rack

Contains one or two fully populated midplanes

Interconnects:

onnects 32 compute cards into a 5-d torus of length 2. Each card consists of one

compute integrated circuit and 72 memory chips

Signaling rate @ 4Gbps




Midplane interconnection is a 512 node, 5-d torus of size 4 x 4 x 4 x 4 x 2

Contains one or two fully populated midplanes

. Each card consists of one




d torus of size 4 x 4 x 4 x 4 x 2

Figure 4: Compute torus dimensionality of Blue Gene/Q building blocks

CRAY - XC30 Series

The Cray XC30 supercomputer is Cray's next generation supercomputing series, upgradable to 100

PFlops per system. Known as the "Cascade" program, it was made possible in part by Cray's participation

in the Defence Advanced Research Projects Agency's(DARPA) High Productivity Computing Systems

program. One of the key features of this system is the introduction of AriesTM

interconnect, an

interconnect chipset follow-on to Gemini with a new system interconnect topology called "Dragonfly".

This topology leads to scalable system size and global network bandwidth.

System Specifications(most important):

Processor 64-bit Intel Xeon E5-2600 Series processors; up to 384 per cabinet

Memory 32-128GB per node Memory bandwidth: Up to 117GB/s per node

Compute Cabinet Initially up to 3,072 processor cores per system cabinet, upgradeable

Interconnect 1 Aries routing and communications ASIC per four compute nodes 48 switch ports per Aries chip (500GB/s switching capacity per chip) Dragonfly interconnect: Low latency, high bandwidth topology

Performance Peak performance: Initially up to 66 Tflops per system cabinet

Power 88 kW per compute cabinet, maximum configuration Circuit requirements (2 per compute cabinet): 100 AMP at 480/277 VAC or 125 AMP at 400/230 VAC (three-phase, neutral and ground) 6 kW per blower cabinet, 20 AMP at 480 VAC, 16

Compute Node - Components:

A pair of Intel Xeon E5 processors (16 or more cores)

8 DDR-3 memory channels

Memory capacity: 64 or 128 GB

Aries Device - Interconnects:

System-on-chip device comprising four Network Interface Controllers, a 48-port tiled router and

a multiplexer

It provides connectivity for four nodes on a blade

Aries devices are connected to each other via cables and backplanes

Figure 5: Aries Device Connecting 4 Nodes

AMP at 400 VAC (three-phase, ground) configuration Circuit requirements (2 per compute cabinet): 100 AMP at 480/277 VAC or 125 AMP at 400/230 VAC (three-phase, neutral and ground) 6 kW per blower cabinet, 20 AMP at 480 VAC, 16 AMP at 400 VAC (three-phase, ground)

Dimensions H 80.25 in. x W 35.56 in. x D 62.00 in. (compute cabinet) H 80.25 in. x W 18.00 in. x D 42.00 in. (blower cabinet) 3450 lbs. per compute cabinet - liquid cooled, 243 lbs./square foot floor loading 750 lbs. per blower cabinet

Dragonfly Topology - Features and Benefits:

Each chassis on the system consists of 16 64-node blades

Each cabinet has three chassis (192 nodes)

Dragonfly network is constructed from two-cabinet electrical groups with 384 nodes/group

All connections within a group are electrical @14Gbps per lane

This topology provides good global bandwidth and low latency at lower cost compared to a fat

tree

Figure 6: Chassis backplane provides all-to-all connectivity between blades. Chassis within the same group connect to each

other bia electrical connectors. Groups connect together via optical cables.

Figure 7: 2-D all-to-all structure is used within a group

FUJITSU - K COMPUTER

K-Computer is a massively parallel computer system developed by Fujitsu and RIKEN as a part of the

High Performance Computing Infrastructure(HPCI) initiative led by Japan's Ministry of Education,

Culture, Sports, Science and Technology. It is designed to perform extremely complex scientific

calculations and it was ranked as the top performing supercomputer in the world in 2011.


CPU

(SPARC64

VIIIfx)

Cores/Node 8 cores @2GHz

Performance 128GFlops

Architecture SPARC V9 + HPC extension

Cache L1(I/D) Cache: 32KB/32KB

L2 Cache: 6MB

Power 58W(typ. 30C)

Mem. Bandwidth 64GB/s

Peak Performance >10PFlops

Node

Configuration 1 CPU/Node

Mem. Capacity 16GB (2GB/core)

System Board(SB) No. of Nodes 4 nodes/SB

Rack No. of SB 24 SBs/rack

System Nodes/System >80.000

Interconnect Topology 6D Mesh/Torus

Performance 5GB/s for each link

No. of Links 10 links/node

Architecture Routing chip structure

(no outside switch box)

Compute Chip(CPU):

Extended SPARC64TM

VII architecture for HPC

High performance and low power consumption

High reliable design

1271 signal pins

Compute Node:

Single CPU and interconnect

controller

10 links for inter-node connection

@10 GB/s per link

Topology - Tofu Interconnect:

Logical 3-D torus network

Physical topology: 6-D Mesh/Torus

10 links/node: 6 for 3D torus (coordinates xyz) and 4 redundant links for 3D mesh/torus (coord:

abc)

Bi-section bandwidth: >30TB/s

Figure 8: Topology

6-D Benefits:

High Scalability

Twelve times higher than the 3D torus

High Performance and Operability

Low hop-count (average hop count about 1/2 of 3D torus)

High Fault Tolerance

12 possible alternate paths to bypass faulty nodes

Redundant node can be assigned preserving the torus topology


Figure 9: K-Computer Packaging Hierarchy

DELL - STAMPEDE

Stampede was developed by Dell at Texas Advanced Computing Center(TACC) as an NSF-funded project.

It is a new HPC system and one of the largest computing systems in the world for open science research.

It was ranked 7th

in the Top500 List in 2012.


Processor Intel 8-core Xeon E5 processor 2 per node / core freq @2.7GHz,

Intel 61-core Xeon Phi Coprocessor 1 per node / core freq @1.1GHz

Memory 32GB per node (host memory), additional 8GB on the coprocessor card

Packaging 6.400 nodes in 160 racks (40 nodes/rack) along with 2 38-port Mellanox leaf switches

Interconnect Interconnected nodes with Mellanox FDR InfiniBand technology in a 2- level(cores and leafs) fat tree topology @56GB/s

Performance Peak performance: 10PFlops Power >6MW total power

Compute Node:

Figure 10: Stampede Zeus Compute Node

Interconnect:

It consists of Mellanox switches, fiber cables and Host Channel Adapters approx. 75 miles of

cables

Eight core 648-port SX6536 switches and > 320 36-port SX6025 endpoint switches form a 2-level

Clos fat tree topology

They have 4.0 and 73 TB/s capabilities respectively

5/4 oversubscription at the endpoint switches(leafs)

Key feature Any MPI message is at most 5 hops from source to destination

Figure 11: Stampede Interconnect

NUDT - TIANHE-I

Developed by the Chinese National University of Defense Technology, it was the fastest computer in the

world from Oct. 2010 to June 2011 an now it is one of the few Petascale supercomputers in the world.


Interconnect:

Use of high-radix Network Routing Chips (NRC) and high-speed Network Interface Chips(NIC)

NRC and NIC chips are designed with self-intellectual property

Topology is an optic-electronic hybrid hierarchical fat-tree structure

Bi-directional bandwidth @ 160Gbps, latency: 1.57s, switch density: 61.44Tbps in a single

backboard

Figure 12: Architecture of the Interconnect Network

Processor Two 6-core Intel Xeon X5670 CPUs @ 2.93GHz and one NVIDIA M2050 GPU @1.15GHz per compute node

Two 8-core FT-1000 CPUs @1.0GHz per service

node

Memory Total: 262 TB Nodes-Packaging 7.168 compute nodes and 1.024 service nodes

installed in 120 racks

Interconnect Fat-tree structure with bi-directional bandwidth @ 160Gbps

Performance Peak performance: 4700TFlops Power 4.04MW

The first layer consists of 480 switching boards. In each rack, every 16 nodes connect with each

other through the switch board in the rack. The main boards are connected with the switch

board via a back board. Communication on switching boards uses electrical transmission.

The second layer contains 11 384-port switches connected with QSFP optical fibers. There are

12 leaf switch boards and 12 root switch boards in each 384-port switch. A high density back

board connects them in an orthotropic way. The switch boards in the rack are connected to 384-

port switches through optical fibers.

References:

[1] Top 500 supercomputer sites: http://www.top500.org/

[2] A. Bland, R. Kendall, D. Kothe, J. Rogers, G. Shipman. Jaguar: The Worlds Most Powerful

Computer

[3] B. Bland. The worlds most powerful computer, presentation at Cray Users Group 2009

Meeting

[4] Jaguar (supercomputer): http://en.wikipedia.org/wiki/Jaguar_(supercomputer)

[5] Titan (supercomputer): http://en.wikipedia.org/wiki/Titan_(supercomputer)

[6] Cray XK7: http://www.cray.com/Products/Computing/XK7.aspx

[7] Blue Gene/Q: http://en.wikipedia.org/wiki/Blue_Gene

[8] IBM. IBM System Blue Gene/Q, IBM Systems and Technology Data Sheet

[9] J. Milano, P. Lembke. IBM System Blue Gene Solution, Blue Gene/Q Hardware Overview and

Installation Planning, Redbooks

[10] M. Ohmacht/ IBM BlueGene Team. Memory Speculation of the Blue Gene/Q Compute Chip,

presentation, 2011

[11] R. Wisniewski/ IBM BlueGene Team. BlueGene/Q: Architecture CoDesign; Path to Exascale,

presentation, 2012

[12] Cray XC30: http://www.cray.com/Products/Computing/XC.aspx

[13] B. Alverson, E. Froese, L. Kaplan, D. Roweth / Cray Inc. Cray XC Series Network,

http://www.cray.com/Assets/PDF/products/xc/CrayXC30Networking.pdf

[14] K-computer: http://en.wikipedia.org/wiki/K_computer

[15] T. Inoue. The 6D Mesh/Torus Interconnect of K Computer, presentation

[16] T. Maruyama. SPARC64TM

VIIIfx: Fujitsus New Generation Octo Core Processor for PETA Scale

Computing, presentation, 2009

[17] M. Sato. The K computer and Xcalable MP parallel language project, presentation

[18] Y. Ajima, T. Inoue, S. Hiramoto, T. Shimizu. Tofu: Interconnect for the K computer, Fujitsu Sci.

Tech. J., Vol 48, No. 3, pp. 280-285 (July 2012)

[19] H. Maeda, H. Kubo, H. Shimamori, A. Tamura, J. Wei. System Packaging Technologies for the K

Computer, Fujitsu Sci. Tech. J., Vol 48, No. 3, pp. 286-294 (July 2012)

[20] Stampede: http://www.tacc.utexas.edu/resources/hpc/#stampede

[21] Stampede: https://www.xsede.org/web/guest/tacc-stampede#overview

[22] D. Stanzione. The Stampede is Coming: A New Petascale Resource for the Open Science

Community, presentation

[23] S. Lantz. Parallel Programming on Ranger and Stampede, presentation, 2012

[24] Tianhe 1A: http://www.nscc-tj.gov.cn/en/resources/resources_1.asp#TH-1A

[25] X. Yang, X. Liao, K. Lu et al. The TianHe-1A supercomputer: Its hardware and software.,

Journal of Computer Science and Technology, 26(3): 344-351 May 2011

Hpc Overview

Documents

Transcript of Hpc Overview