Hpc Overview

download Hpc Overview

of 17

description

Overview of High Performance Computing (HPC) systems

Transcript of Hpc Overview

  • Top High Performance Computing Systems

    A Short Overview

  • CRAY - JAGUAR

    Jaguar was a pentascale supercomputer built by Cray at Oak Ridge National Laboratory that had a peak

    performance of 1.75 PFlops. In 2012 XT5 Jaguar was replaced by Titan which is currently ranked as the

    worlds fastest computer and is a Cray XK7 system. Key features of Jaguar follow next and then the

    differences of its upgrade, Titan.

    System Specifications (most important):

    Compute Node:

    2 AMD Opteron 2435 Hex-Cores

    Cray specific SeaStar network

    Improves acalability

    Low latency, high bandwidth

    Interconnect

    Upgradable processor, memory and

    interconnect

    Topology:

    3D torus network

    192 IO nodes distributed across the torus to prevent hot-spots, connected to the storage via

    Infiniband

    Fabric connections provide redundant paths

    Processor 224256 AMD Opteron Cores @ 2.6 GHz Memory 300 TB Cabinets 200 (8 rows of 25 cabinets) Compute Nodes 18.688 Performance Peak performance 1.75 PFlops Size 4.600 feet2 Power 480 volt power/cabinet

  • Figure 1: XT5 Topology

    Packaging Hierarchy:

    Blade: 4 network connections

    Cabinet(Rack): 192 Opteron Processors 776 Opteron Cores, 96 nodes

    System: 200 cabinets

    Figure 2: Building the Cray XT5 System

  • Titan - Main Differences:

    Jaguar has been loaded up with 18.688 of the first shipping NVIDIA Telsa K20 GPUs and renamed

    Titan

    Compute node: Consists of a 16-core AMD Opteron 6274 processor and a K20 GPU accelerator

    Memory: 710TB

    Peak Performance: more than 20 PFlops

    Titan Interconnect:

    1 Gemini routing and communications ASIC per two nodes

    48 switch ports per Gemini chip (160 GB/s internal switching capacity per chip)

    3D torus interconnect

    Gemini Interconnect:

    Enables tens of millions of MPI messages/second

    Each hybrid node is interfaced to the Gemini interconnect through HyperTransportTM

    3.0

    technlogy

    This architecture bypasses the PCI bottlenecks and provides a peak of 20 GB/s of injection

    bandwidth/node

    Connectionless protocol Highly scalable architecture

    Powerful bisection and global bandwidth

    Dynamic routing of messages

    Figure 3: Gemini Interconnect

  • IBM - Blue Gene/Q

    Blue Gene/Q is the third generation of IBMs series of HPC systems, Blue Gene, following the Blue

    Gene/L and /P architectures. It is a massively parallel computer and its purpose is to solve large scale

    scientific problems, such as protein folding.

    Key Features:

    Small footprint

    High power efficiency (worlds most efficient supercomputer - Green500/June 2011)

    Low latency, high bandwidth inter-processor communication system

    1.024 nodes/rack

    Scales to 512 racks (currently implemented with 256) Peak Performance: 100 PF

    Integrated 5D torus with optical cables for inter-processor communication Tremendous

    bisection bandwidth

    System Specifications (most important):

    Processor IBM PowerPC A2 1.6 GHz, 16 cores per node Memory 16 GB SDRAM-DDR3 per node (1333 MTps) Networks 5D Torus 40 GBps; 2.5 sec latency

    Collective network part of the 5D Torus; collective logic operations supported Global Barrier/Interrupt part of 5D Torus PCIe x8 Gen2 based I/O 1 GB Control Network System Boot, Debug, Monitoring

    I/O Nodes (10 GbE or InfiniBand) 16-way SMP processor; configurable in 8,16 or 32 I/O nodes per rack

    Performance Peak performance per rack - 209.7 TFlops Power Typical 80 kW per rack (estimated) 380-415, 480

    VAC 3-phase; maximum 100 kW per rack; 4x60 amp service per rack

    Dimensions Height: 2095 mm Width: 1219 mm Depth: 1321 mm Weight: 4500 lbs with coolant (LLNL 1 IO drawer configuration) Service clearances: 914 mm on all sides

  • Compute Chip:

    System-on-a-Chip Design: integrates processors, memory and networking logic into a single chip

    - 360 mm2 Cu-45 technology (SOI)

    - 16 user + 1 service processors

    - plus 1 redundant processor

    - all processors are symmetric

    - 1.6 GHz

    - L1 I/D cache = 16kB/16kB

    - L1 prefetch engines

    - peak performance 204.8 GFlops @ 55W

    - Central shared L2 cache: 32 MB

    - Dual memory controller

    - Chip-to-chip networking

    - Router logic integrated into BQC chip

    - External IO

    Crossbar Switch:

    Central connection structure between processors, L2 cache, networking logic and various low-

    bandwidth units

    Frequency: 800MHZ (half frequency clock grid)

    3 separate switches:

    Request traffic -- write bandwidth 12B/PUnit @ 800 MHz

    Response traffic -- write bandwidth 12B/PUnit @ 800 MHz

    Invalidate traffic

    22 master ports

    18 slave ports

    Peak on-chip bisection bandwidth 563 GB/s

    Networking Logic:

    Communication ports:

    11 bidirectional chip-to-chip links @ 2GB/s

    2 links can be used for PCIe Gen2 x8

    On-chip networking logic

    Implements 14-port router

    Designed to support point-to-point, collective and barrier messages

  • Packaging Hierarchy:

    Basic Topology - Interconnect

    Node oard

    Connects 32 compute cards into a

    compute integrated circuit and 72 memory chips

    Signaling rate @

    Boards on the same midplane are connected via the midplane connector

    Boards on different midplanes are connected via 8 link chips on each midplane

    These link chips drive optical transceivers (@10Gbps) through optical cables

    Midplane

    Midplane interconnectio

    Rack

    Contains one or two fully populated midplanes

    Interconnects:

    onnects 32 compute cards into a 5-d torus of length 2. Each card consists of one

    compute integrated circuit and 72 memory chips

    Signaling rate @ 4Gbps

    Boards on the same midplane are connected via the midplane connector

    Boards on different midplanes are connected via 8 link chips on each midplane

    These link chips drive optical transceivers (@10Gbps) through optical cables

    Midplane interconnection is a 512 node, 5-d torus of size 4 x 4 x 4 x 4 x 2

    Contains one or two fully populated midplanes

    . Each card consists of one

    Boards on the same midplane are connected via the midplane connector

    Boards on different midplanes are connected via 8 link chips on each midplane

    These link chips drive optical transceivers (@10Gbps) through optical cables

    d torus of size 4 x 4 x 4 x 4 x 2

  • Figure 4: Compute torus dimensionality of Blue Gene/Q building blocks

    CRAY - XC30 Series

    The Cray XC30 supercomputer is Cray's next generation supercomputing series, upgradable to 100

    PFlops per system. Known as the "Cascade" program, it was made possible in part by Cray's participation

    in the Defence Advanced Research Projects Agency's(DARPA) High Productivity Computing Systems

    program. One of the key features of this system is the introduction of AriesTM

    interconnect, an

    interconnect chipset follow-on to Gemini with a new system interconnect topology called "Dragonfly".

    This topology leads to scalable system size and global network bandwidth.

    System Specifications(most important):

    Processor 64-bit Intel Xeon E5-2600 Series processors; up to 384 per cabinet

    Memory 32-128GB per node Memory bandwidth: Up to 117GB/s per node

    Compute Cabinet Initially up to 3,072 processor cores per system cabinet, upgradeable

    Interconnect 1 Aries routing and communications ASIC per four compute nodes 48 switch ports per Aries chip (500GB/s switching capacity per chip) Dragonfly interconnect: Low latency, high bandwidth topology

    Performance Peak performance: Initially up to 66 Tflops per system cabinet

    Power 88 kW per compute cabinet, maximum configuration Circuit requirements (2 per compute cabinet): 100 AMP at 480/277 VAC or 125 AMP at 400/230 VAC (three-phase, neutral and ground) 6 kW per blower cabinet, 20 AMP at 480 VAC, 16

  • Compute Node - Components:

    A pair of Intel Xeon E5 processors (16 or more cores)

    8 DDR-3 memory channels

    Memory capacity: 64 or 128 GB

    Aries Device - Interconnects:

    System-on-chip device comprising four Network Interface Controllers, a 48-port tiled router and

    a multiplexer

    It provides connectivity for four nodes on a blade

    Aries devices are connected to each other via cables and backplanes

    Figure 5: Aries Device Connecting 4 Nodes

    AMP at 400 VAC (three-phase, ground) configuration Circuit requirements (2 per compute cabinet): 100 AMP at 480/277 VAC or 125 AMP at 400/230 VAC (three-phase, neutral and ground) 6 kW per blower cabinet, 20 AMP at 480 VAC, 16 AMP at 400 VAC (three-phase, ground)

    Dimensions H 80.25 in. x W 35.56 in. x D 62.00 in. (compute cabinet) H 80.25 in. x W 18.00 in. x D 42.00 in. (blower cabinet) 3450 lbs. per compute cabinet - liquid cooled, 243 lbs./square foot floor loading 750 lbs. per blower cabinet

  • Dragonfly Topology - Features and Benefits:

    Each chassis on the system consists of 16 64-node blades

    Each cabinet has three chassis (192 nodes)

    Dragonfly network is constructed from two-cabinet electrical groups with 384 nodes/group

    All connections within a group are electrical @14Gbps per lane

    This topology provides good global bandwidth and low latency at lower cost compared to a fat

    tree

    Figure 6: Chassis backplane provides all-to-all connectivity between blades. Chassis within the same group connect to each

    other bia electrical connectors. Groups connect together via optical cables.

    Figure 7: 2-D all-to-all structure is used within a group

  • FUJITSU - K COMPUTER

    K-Computer is a massively parallel computer system developed by Fujitsu and RIKEN as a part of the

    High Performance Computing Infrastructure(HPCI) initiative led by Japan's Ministry of Education,

    Culture, Sports, Science and Technology. It is designed to perform extremely complex scientific

    calculations and it was ranked as the top performing supercomputer in the world in 2011.

    System Specifications (most important):

    CPU

    (SPARC64

    VIIIfx)

    Cores/Node 8 cores @2GHz

    Performance 128GFlops

    Architecture SPARC V9 + HPC extension

    Cache L1(I/D) Cache: 32KB/32KB

    L2 Cache: 6MB

    Power 58W(typ. 30C)

    Mem. Bandwidth 64GB/s

    Peak Performance >10PFlops

    Node

    Configuration 1 CPU/Node

    Mem. Capacity 16GB (2GB/core)

    System Board(SB) No. of Nodes 4 nodes/SB

    Rack No. of SB 24 SBs/rack

    System Nodes/System >80.000

    Interconnect Topology 6D Mesh/Torus

    Performance 5GB/s for each link

    No. of Links 10 links/node

    Architecture Routing chip structure

    (no outside switch box)

    Compute Chip(CPU):

    Extended SPARC64TM

    VII architecture for HPC

    High performance and low power consumption

    High reliable design

    1271 signal pins

  • Compute Node:

    Single CPU and interconnect

    controller

    10 links for inter-node connection

    @10 GB/s per link

    Topology - Tofu Interconnect:

    Logical 3-D torus network

    Physical topology: 6-D Mesh/Torus

    10 links/node: 6 for 3D torus (coordinates xyz) and 4 redundant links for 3D mesh/torus (coord:

    abc)

    Bi-section bandwidth: >30TB/s

    Figure 8: Topology

    6-D Benefits:

    High Scalability

    Twelve times higher than the 3D torus

    High Performance and Operability

    Low hop-count (average hop count about 1/2 of 3D torus)

    High Fault Tolerance

    12 possible alternate paths to bypass faulty nodes

    Redundant node can be assigned preserving the torus topology

  • Packaging Hierarchy:

    Figure 9: K-Computer Packaging Hierarchy

    DELL - STAMPEDE

    Stampede was developed by Dell at Texas Advanced Computing Center(TACC) as an NSF-funded project.

    It is a new HPC system and one of the largest computing systems in the world for open science research.

    It was ranked 7th

    in the Top500 List in 2012.

    System Specifications(most important):

    Processor Intel 8-core Xeon E5 processor 2 per node / core freq @2.7GHz,

    Intel 61-core Xeon Phi Coprocessor 1 per node / core freq @1.1GHz

    Memory 32GB per node (host memory), additional 8GB on the coprocessor card

    Packaging 6.400 nodes in 160 racks (40 nodes/rack) along with 2 38-port Mellanox leaf switches

    Interconnect Interconnected nodes with Mellanox FDR InfiniBand technology in a 2- level(cores and leafs) fat tree topology @56GB/s

    Performance Peak performance: 10PFlops Power >6MW total power

  • Compute Node:

    Figure 10: Stampede Zeus Compute Node

    Interconnect:

    It consists of Mellanox switches, fiber cables and Host Channel Adapters approx. 75 miles of

    cables

    Eight core 648-port SX6536 switches and > 320 36-port SX6025 endpoint switches form a 2-level

    Clos fat tree topology

    They have 4.0 and 73 TB/s capabilities respectively

    5/4 oversubscription at the endpoint switches(leafs)

    Key feature Any MPI message is at most 5 hops from source to destination

    Figure 11: Stampede Interconnect

    NUDT - TIANHE-I

    Developed by the Chinese National University of Defense Technology, it was the fastest computer in the

    world from Oct. 2010 to June 2011 an now it is one of the few Petascale supercomputers in the world.

  • System Specifications(most important):

    Interconnect:

    Use of high-radix Network Routing Chips (NRC) and high-speed Network Interface Chips(NIC)

    NRC and NIC chips are designed with self-intellectual property

    Topology is an optic-electronic hybrid hierarchical fat-tree structure

    Bi-directional bandwidth @ 160Gbps, latency: 1.57s, switch density: 61.44Tbps in a single

    backboard

    Figure 12: Architecture of the Interconnect Network

    Processor Two 6-core Intel Xeon X5670 CPUs @ 2.93GHz and one NVIDIA M2050 GPU @1.15GHz per compute node

    Two 8-core FT-1000 CPUs @1.0GHz per service

    node

    Memory Total: 262 TB Nodes-Packaging 7.168 compute nodes and 1.024 service nodes

    installed in 120 racks

    Interconnect Fat-tree structure with bi-directional bandwidth @ 160Gbps

    Performance Peak performance: 4700TFlops Power 4.04MW

  • The first layer consists of 480 switching boards. In each rack, every 16 nodes connect with each

    other through the switch board in the rack. The main boards are connected with the switch

    board via a back board. Communication on switching boards uses electrical transmission.

    The second layer contains 11 384-port switches connected with QSFP optical fibers. There are

    12 leaf switch boards and 12 root switch boards in each 384-port switch. A high density back

    board connects them in an orthotropic way. The switch boards in the rack are connected to 384-

    port switches through optical fibers.

    References:

    [1] Top 500 supercomputer sites: http://www.top500.org/

    [2] A. Bland, R. Kendall, D. Kothe, J. Rogers, G. Shipman. Jaguar: The Worlds Most Powerful

    Computer

    [3] B. Bland. The worlds most powerful computer, presentation at Cray Users Group 2009

    Meeting

    [4] Jaguar (supercomputer): http://en.wikipedia.org/wiki/Jaguar_(supercomputer)

    [5] Titan (supercomputer): http://en.wikipedia.org/wiki/Titan_(supercomputer)

    [6] Cray XK7: http://www.cray.com/Products/Computing/XK7.aspx

    [7] Blue Gene/Q: http://en.wikipedia.org/wiki/Blue_Gene

    [8] IBM. IBM System Blue Gene/Q, IBM Systems and Technology Data Sheet

    [9] J. Milano, P. Lembke. IBM System Blue Gene Solution, Blue Gene/Q Hardware Overview and

    Installation Planning, Redbooks

    [10] M. Ohmacht/ IBM BlueGene Team. Memory Speculation of the Blue Gene/Q Compute Chip,

    presentation, 2011

    [11] R. Wisniewski/ IBM BlueGene Team. BlueGene/Q: Architecture CoDesign; Path to Exascale,

    presentation, 2012

    [12] Cray XC30: http://www.cray.com/Products/Computing/XC.aspx

    [13] B. Alverson, E. Froese, L. Kaplan, D. Roweth / Cray Inc. Cray XC Series Network,

    http://www.cray.com/Assets/PDF/products/xc/CrayXC30Networking.pdf

    [14] K-computer: http://en.wikipedia.org/wiki/K_computer

    [15] T. Inoue. The 6D Mesh/Torus Interconnect of K Computer, presentation

    [16] T. Maruyama. SPARC64TM

    VIIIfx: Fujitsus New Generation Octo Core Processor for PETA Scale

    Computing, presentation, 2009

    [17] M. Sato. The K computer and Xcalable MP parallel language project, presentation

    [18] Y. Ajima, T. Inoue, S. Hiramoto, T. Shimizu. Tofu: Interconnect for the K computer, Fujitsu Sci.

    Tech. J., Vol 48, No. 3, pp. 280-285 (July 2012)

    [19] H. Maeda, H. Kubo, H. Shimamori, A. Tamura, J. Wei. System Packaging Technologies for the K

    Computer, Fujitsu Sci. Tech. J., Vol 48, No. 3, pp. 286-294 (July 2012)

    [20] Stampede: http://www.tacc.utexas.edu/resources/hpc/#stampede

    [21] Stampede: https://www.xsede.org/web/guest/tacc-stampede#overview

  • [22] D. Stanzione. The Stampede is Coming: A New Petascale Resource for the Open Science

    Community, presentation

    [23] S. Lantz. Parallel Programming on Ranger and Stampede, presentation, 2012

    [24] Tianhe 1A: http://www.nscc-tj.gov.cn/en/resources/resources_1.asp#TH-1A

    [25] X. Yang, X. Liao, K. Lu et al. The TianHe-1A supercomputer: Its hardware and software.,

    Journal of Computer Science and Technology, 26(3): 344-351 May 2011