Hpc Overview
description
Transcript of Hpc Overview
-
Top High Performance Computing Systems
A Short Overview
-
CRAY - JAGUAR
Jaguar was a pentascale supercomputer built by Cray at Oak Ridge National Laboratory that had a peak
performance of 1.75 PFlops. In 2012 XT5 Jaguar was replaced by Titan which is currently ranked as the
worlds fastest computer and is a Cray XK7 system. Key features of Jaguar follow next and then the
differences of its upgrade, Titan.
System Specifications (most important):
Compute Node:
2 AMD Opteron 2435 Hex-Cores
Cray specific SeaStar network
Improves acalability
Low latency, high bandwidth
Interconnect
Upgradable processor, memory and
interconnect
Topology:
3D torus network
192 IO nodes distributed across the torus to prevent hot-spots, connected to the storage via
Infiniband
Fabric connections provide redundant paths
Processor 224256 AMD Opteron Cores @ 2.6 GHz Memory 300 TB Cabinets 200 (8 rows of 25 cabinets) Compute Nodes 18.688 Performance Peak performance 1.75 PFlops Size 4.600 feet2 Power 480 volt power/cabinet
-
Figure 1: XT5 Topology
Packaging Hierarchy:
Blade: 4 network connections
Cabinet(Rack): 192 Opteron Processors 776 Opteron Cores, 96 nodes
System: 200 cabinets
Figure 2: Building the Cray XT5 System
-
Titan - Main Differences:
Jaguar has been loaded up with 18.688 of the first shipping NVIDIA Telsa K20 GPUs and renamed
Titan
Compute node: Consists of a 16-core AMD Opteron 6274 processor and a K20 GPU accelerator
Memory: 710TB
Peak Performance: more than 20 PFlops
Titan Interconnect:
1 Gemini routing and communications ASIC per two nodes
48 switch ports per Gemini chip (160 GB/s internal switching capacity per chip)
3D torus interconnect
Gemini Interconnect:
Enables tens of millions of MPI messages/second
Each hybrid node is interfaced to the Gemini interconnect through HyperTransportTM
3.0
technlogy
This architecture bypasses the PCI bottlenecks and provides a peak of 20 GB/s of injection
bandwidth/node
Connectionless protocol Highly scalable architecture
Powerful bisection and global bandwidth
Dynamic routing of messages
Figure 3: Gemini Interconnect
-
IBM - Blue Gene/Q
Blue Gene/Q is the third generation of IBMs series of HPC systems, Blue Gene, following the Blue
Gene/L and /P architectures. It is a massively parallel computer and its purpose is to solve large scale
scientific problems, such as protein folding.
Key Features:
Small footprint
High power efficiency (worlds most efficient supercomputer - Green500/June 2011)
Low latency, high bandwidth inter-processor communication system
1.024 nodes/rack
Scales to 512 racks (currently implemented with 256) Peak Performance: 100 PF
Integrated 5D torus with optical cables for inter-processor communication Tremendous
bisection bandwidth
System Specifications (most important):
Processor IBM PowerPC A2 1.6 GHz, 16 cores per node Memory 16 GB SDRAM-DDR3 per node (1333 MTps) Networks 5D Torus 40 GBps; 2.5 sec latency
Collective network part of the 5D Torus; collective logic operations supported Global Barrier/Interrupt part of 5D Torus PCIe x8 Gen2 based I/O 1 GB Control Network System Boot, Debug, Monitoring
I/O Nodes (10 GbE or InfiniBand) 16-way SMP processor; configurable in 8,16 or 32 I/O nodes per rack
Performance Peak performance per rack - 209.7 TFlops Power Typical 80 kW per rack (estimated) 380-415, 480
VAC 3-phase; maximum 100 kW per rack; 4x60 amp service per rack
Dimensions Height: 2095 mm Width: 1219 mm Depth: 1321 mm Weight: 4500 lbs with coolant (LLNL 1 IO drawer configuration) Service clearances: 914 mm on all sides
-
Compute Chip:
System-on-a-Chip Design: integrates processors, memory and networking logic into a single chip
- 360 mm2 Cu-45 technology (SOI)
- 16 user + 1 service processors
- plus 1 redundant processor
- all processors are symmetric
- 1.6 GHz
- L1 I/D cache = 16kB/16kB
- L1 prefetch engines
- peak performance 204.8 GFlops @ 55W
- Central shared L2 cache: 32 MB
- Dual memory controller
- Chip-to-chip networking
- Router logic integrated into BQC chip
- External IO
Crossbar Switch:
Central connection structure between processors, L2 cache, networking logic and various low-
bandwidth units
Frequency: 800MHZ (half frequency clock grid)
3 separate switches:
Request traffic -- write bandwidth 12B/PUnit @ 800 MHz
Response traffic -- write bandwidth 12B/PUnit @ 800 MHz
Invalidate traffic
22 master ports
18 slave ports
Peak on-chip bisection bandwidth 563 GB/s
Networking Logic:
Communication ports:
11 bidirectional chip-to-chip links @ 2GB/s
2 links can be used for PCIe Gen2 x8
On-chip networking logic
Implements 14-port router
Designed to support point-to-point, collective and barrier messages
-
Packaging Hierarchy:
Basic Topology - Interconnect
Node oard
Connects 32 compute cards into a
compute integrated circuit and 72 memory chips
Signaling rate @
Boards on the same midplane are connected via the midplane connector
Boards on different midplanes are connected via 8 link chips on each midplane
These link chips drive optical transceivers (@10Gbps) through optical cables
Midplane
Midplane interconnectio
Rack
Contains one or two fully populated midplanes
Interconnects:
onnects 32 compute cards into a 5-d torus of length 2. Each card consists of one
compute integrated circuit and 72 memory chips
Signaling rate @ 4Gbps
Boards on the same midplane are connected via the midplane connector
Boards on different midplanes are connected via 8 link chips on each midplane
These link chips drive optical transceivers (@10Gbps) through optical cables
Midplane interconnection is a 512 node, 5-d torus of size 4 x 4 x 4 x 4 x 2
Contains one or two fully populated midplanes
. Each card consists of one
Boards on the same midplane are connected via the midplane connector
Boards on different midplanes are connected via 8 link chips on each midplane
These link chips drive optical transceivers (@10Gbps) through optical cables
d torus of size 4 x 4 x 4 x 4 x 2
-
Figure 4: Compute torus dimensionality of Blue Gene/Q building blocks
CRAY - XC30 Series
The Cray XC30 supercomputer is Cray's next generation supercomputing series, upgradable to 100
PFlops per system. Known as the "Cascade" program, it was made possible in part by Cray's participation
in the Defence Advanced Research Projects Agency's(DARPA) High Productivity Computing Systems
program. One of the key features of this system is the introduction of AriesTM
interconnect, an
interconnect chipset follow-on to Gemini with a new system interconnect topology called "Dragonfly".
This topology leads to scalable system size and global network bandwidth.
System Specifications(most important):
Processor 64-bit Intel Xeon E5-2600 Series processors; up to 384 per cabinet
Memory 32-128GB per node Memory bandwidth: Up to 117GB/s per node
Compute Cabinet Initially up to 3,072 processor cores per system cabinet, upgradeable
Interconnect 1 Aries routing and communications ASIC per four compute nodes 48 switch ports per Aries chip (500GB/s switching capacity per chip) Dragonfly interconnect: Low latency, high bandwidth topology
Performance Peak performance: Initially up to 66 Tflops per system cabinet
Power 88 kW per compute cabinet, maximum configuration Circuit requirements (2 per compute cabinet): 100 AMP at 480/277 VAC or 125 AMP at 400/230 VAC (three-phase, neutral and ground) 6 kW per blower cabinet, 20 AMP at 480 VAC, 16
-
Compute Node - Components:
A pair of Intel Xeon E5 processors (16 or more cores)
8 DDR-3 memory channels
Memory capacity: 64 or 128 GB
Aries Device - Interconnects:
System-on-chip device comprising four Network Interface Controllers, a 48-port tiled router and
a multiplexer
It provides connectivity for four nodes on a blade
Aries devices are connected to each other via cables and backplanes
Figure 5: Aries Device Connecting 4 Nodes
AMP at 400 VAC (three-phase, ground) configuration Circuit requirements (2 per compute cabinet): 100 AMP at 480/277 VAC or 125 AMP at 400/230 VAC (three-phase, neutral and ground) 6 kW per blower cabinet, 20 AMP at 480 VAC, 16 AMP at 400 VAC (three-phase, ground)
Dimensions H 80.25 in. x W 35.56 in. x D 62.00 in. (compute cabinet) H 80.25 in. x W 18.00 in. x D 42.00 in. (blower cabinet) 3450 lbs. per compute cabinet - liquid cooled, 243 lbs./square foot floor loading 750 lbs. per blower cabinet
-
Dragonfly Topology - Features and Benefits:
Each chassis on the system consists of 16 64-node blades
Each cabinet has three chassis (192 nodes)
Dragonfly network is constructed from two-cabinet electrical groups with 384 nodes/group
All connections within a group are electrical @14Gbps per lane
This topology provides good global bandwidth and low latency at lower cost compared to a fat
tree
Figure 6: Chassis backplane provides all-to-all connectivity between blades. Chassis within the same group connect to each
other bia electrical connectors. Groups connect together via optical cables.
Figure 7: 2-D all-to-all structure is used within a group
-
FUJITSU - K COMPUTER
K-Computer is a massively parallel computer system developed by Fujitsu and RIKEN as a part of the
High Performance Computing Infrastructure(HPCI) initiative led by Japan's Ministry of Education,
Culture, Sports, Science and Technology. It is designed to perform extremely complex scientific
calculations and it was ranked as the top performing supercomputer in the world in 2011.
System Specifications (most important):
CPU
(SPARC64
VIIIfx)
Cores/Node 8 cores @2GHz
Performance 128GFlops
Architecture SPARC V9 + HPC extension
Cache L1(I/D) Cache: 32KB/32KB
L2 Cache: 6MB
Power 58W(typ. 30C)
Mem. Bandwidth 64GB/s
Peak Performance >10PFlops
Node
Configuration 1 CPU/Node
Mem. Capacity 16GB (2GB/core)
System Board(SB) No. of Nodes 4 nodes/SB
Rack No. of SB 24 SBs/rack
System Nodes/System >80.000
Interconnect Topology 6D Mesh/Torus
Performance 5GB/s for each link
No. of Links 10 links/node
Architecture Routing chip structure
(no outside switch box)
Compute Chip(CPU):
Extended SPARC64TM
VII architecture for HPC
High performance and low power consumption
High reliable design
1271 signal pins
-
Compute Node:
Single CPU and interconnect
controller
10 links for inter-node connection
@10 GB/s per link
Topology - Tofu Interconnect:
Logical 3-D torus network
Physical topology: 6-D Mesh/Torus
10 links/node: 6 for 3D torus (coordinates xyz) and 4 redundant links for 3D mesh/torus (coord:
abc)
Bi-section bandwidth: >30TB/s
Figure 8: Topology
6-D Benefits:
High Scalability
Twelve times higher than the 3D torus
High Performance and Operability
Low hop-count (average hop count about 1/2 of 3D torus)
High Fault Tolerance
12 possible alternate paths to bypass faulty nodes
Redundant node can be assigned preserving the torus topology
-
Packaging Hierarchy:
Figure 9: K-Computer Packaging Hierarchy
DELL - STAMPEDE
Stampede was developed by Dell at Texas Advanced Computing Center(TACC) as an NSF-funded project.
It is a new HPC system and one of the largest computing systems in the world for open science research.
It was ranked 7th
in the Top500 List in 2012.
System Specifications(most important):
Processor Intel 8-core Xeon E5 processor 2 per node / core freq @2.7GHz,
Intel 61-core Xeon Phi Coprocessor 1 per node / core freq @1.1GHz
Memory 32GB per node (host memory), additional 8GB on the coprocessor card
Packaging 6.400 nodes in 160 racks (40 nodes/rack) along with 2 38-port Mellanox leaf switches
Interconnect Interconnected nodes with Mellanox FDR InfiniBand technology in a 2- level(cores and leafs) fat tree topology @56GB/s
Performance Peak performance: 10PFlops Power >6MW total power
-
Compute Node:
Figure 10: Stampede Zeus Compute Node
Interconnect:
It consists of Mellanox switches, fiber cables and Host Channel Adapters approx. 75 miles of
cables
Eight core 648-port SX6536 switches and > 320 36-port SX6025 endpoint switches form a 2-level
Clos fat tree topology
They have 4.0 and 73 TB/s capabilities respectively
5/4 oversubscription at the endpoint switches(leafs)
Key feature Any MPI message is at most 5 hops from source to destination
Figure 11: Stampede Interconnect
NUDT - TIANHE-I
Developed by the Chinese National University of Defense Technology, it was the fastest computer in the
world from Oct. 2010 to June 2011 an now it is one of the few Petascale supercomputers in the world.
-
System Specifications(most important):
Interconnect:
Use of high-radix Network Routing Chips (NRC) and high-speed Network Interface Chips(NIC)
NRC and NIC chips are designed with self-intellectual property
Topology is an optic-electronic hybrid hierarchical fat-tree structure
Bi-directional bandwidth @ 160Gbps, latency: 1.57s, switch density: 61.44Tbps in a single
backboard
Figure 12: Architecture of the Interconnect Network
Processor Two 6-core Intel Xeon X5670 CPUs @ 2.93GHz and one NVIDIA M2050 GPU @1.15GHz per compute node
Two 8-core FT-1000 CPUs @1.0GHz per service
node
Memory Total: 262 TB Nodes-Packaging 7.168 compute nodes and 1.024 service nodes
installed in 120 racks
Interconnect Fat-tree structure with bi-directional bandwidth @ 160Gbps
Performance Peak performance: 4700TFlops Power 4.04MW
-
The first layer consists of 480 switching boards. In each rack, every 16 nodes connect with each
other through the switch board in the rack. The main boards are connected with the switch
board via a back board. Communication on switching boards uses electrical transmission.
The second layer contains 11 384-port switches connected with QSFP optical fibers. There are
12 leaf switch boards and 12 root switch boards in each 384-port switch. A high density back
board connects them in an orthotropic way. The switch boards in the rack are connected to 384-
port switches through optical fibers.
References:
[1] Top 500 supercomputer sites: http://www.top500.org/
[2] A. Bland, R. Kendall, D. Kothe, J. Rogers, G. Shipman. Jaguar: The Worlds Most Powerful
Computer
[3] B. Bland. The worlds most powerful computer, presentation at Cray Users Group 2009
Meeting
[4] Jaguar (supercomputer): http://en.wikipedia.org/wiki/Jaguar_(supercomputer)
[5] Titan (supercomputer): http://en.wikipedia.org/wiki/Titan_(supercomputer)
[6] Cray XK7: http://www.cray.com/Products/Computing/XK7.aspx
[7] Blue Gene/Q: http://en.wikipedia.org/wiki/Blue_Gene
[8] IBM. IBM System Blue Gene/Q, IBM Systems and Technology Data Sheet
[9] J. Milano, P. Lembke. IBM System Blue Gene Solution, Blue Gene/Q Hardware Overview and
Installation Planning, Redbooks
[10] M. Ohmacht/ IBM BlueGene Team. Memory Speculation of the Blue Gene/Q Compute Chip,
presentation, 2011
[11] R. Wisniewski/ IBM BlueGene Team. BlueGene/Q: Architecture CoDesign; Path to Exascale,
presentation, 2012
[12] Cray XC30: http://www.cray.com/Products/Computing/XC.aspx
[13] B. Alverson, E. Froese, L. Kaplan, D. Roweth / Cray Inc. Cray XC Series Network,
http://www.cray.com/Assets/PDF/products/xc/CrayXC30Networking.pdf
[14] K-computer: http://en.wikipedia.org/wiki/K_computer
[15] T. Inoue. The 6D Mesh/Torus Interconnect of K Computer, presentation
[16] T. Maruyama. SPARC64TM
VIIIfx: Fujitsus New Generation Octo Core Processor for PETA Scale
Computing, presentation, 2009
[17] M. Sato. The K computer and Xcalable MP parallel language project, presentation
[18] Y. Ajima, T. Inoue, S. Hiramoto, T. Shimizu. Tofu: Interconnect for the K computer, Fujitsu Sci.
Tech. J., Vol 48, No. 3, pp. 280-285 (July 2012)
[19] H. Maeda, H. Kubo, H. Shimamori, A. Tamura, J. Wei. System Packaging Technologies for the K
Computer, Fujitsu Sci. Tech. J., Vol 48, No. 3, pp. 286-294 (July 2012)
[20] Stampede: http://www.tacc.utexas.edu/resources/hpc/#stampede
[21] Stampede: https://www.xsede.org/web/guest/tacc-stampede#overview
-
[22] D. Stanzione. The Stampede is Coming: A New Petascale Resource for the Open Science
Community, presentation
[23] S. Lantz. Parallel Programming on Ranger and Stampede, presentation, 2012
[24] Tianhe 1A: http://www.nscc-tj.gov.cn/en/resources/resources_1.asp#TH-1A
[25] X. Yang, X. Liao, K. Lu et al. The TianHe-1A supercomputer: Its hardware and software.,
Journal of Computer Science and Technology, 26(3): 344-351 May 2011