Post on 05-Jan-2016
description
Case study
• IBM Bluegene/L system
• InfiniBand
Interconnect Family share in top 500 supercomputers on 06/2011
and 11/2012
InfiniBand 224
Gigabit Ethernet 189
Custom interconnect 53
Cray interconnect 15
Proprietary Network 15
Myrinet 3
11/2012
Gigabit Ethernet 232
InfiniBand 206
Propritary Network 29
Custom interconnect 23
Myrinet 4
NUMAlink 2
06/2011
Overview of the IBM Blue Gene/L System Architecture
• Design objectives
• Hardware overview– System architecture– Node architecture– Interconnect architecture
Highlights
• A 64K-node highly integrated supercomputer based on system-on-a-chip technology– Two ASICs
• Blue Gene/L compute (BLC), Blue Gene/L Link (BLL)
• Distributed memory, massively parallel processing (MPP) architecture.
• Use the message passing programming model (MPI).
• 360 Tflops peak performance• Optimized for cost/performance
Design objectives
• Objective 1: 360-Tflops supercomputer– Earth Simulator (Japan, fastest supercomputer
from 2002 to 2004): 35.86 Tflops
• Objective 2: power efficiency– Performance/rack = performance/watt *
watt/rack• Watt/rack is a constant of around 20kW• Performance/watt determines performance/rack
• Power efficiency: – 360Tflops => 20 megawatts with conventional
processors
– Need low-power processor design (2-10 times better power efficiency)
Design objectives (continue)
• Objective 3: extreme scalability– Optimized for cost/performance use low
power, less powerful processors need a lot of processors
• Up to 65536 processors.
– Interconnect scalability
Blue Gene/L system components
Blue Gene/L Compute ASIC
• 2 Power PC440 cores with floating-point enhancements– 700MHz– Everything of a typical superscalar processor
• Pipelined microarchitecture with dual instruction fetch, decode, and out of order issue, out of order dispatch, out of order execution and out of order completion, etc
– 1 W each through extensive power management
Blue Gene/L Compute ASIC
Memory system on a BGL node
• BG/L only supports distributed memory paradigm.• No need for efficient support for cache coherence
on each node.– Coherence enforced by software if needed.
• Two cores operate in two modes:– Communication coprocessor mode
• Need coherence, managed in system level libraries
– Virtual node mode• Memory is physical partitioned (not shared).
Blue Gene/L networks
• Five networks.– 100 Mbps Ethernet control network for diagnostics,
debugging, and some other things.– 1000 Mbps Ethernet for I/O– Three high-band width, low-latency networks for data
transmission and synchronization.• 3-D torus network for point-to-point communication• Collective network for global operations• Barrier network
• All network logic is integrated in the BG/L node ASIC– Memory mapped interfaces from user space
3-D torus network
• Support p2p communication• Link bandwidth 1.4Gb/s, 6
bidirectional link per node (1.2GB/s).
• 64x32x32 torus: diameter 32+16+16=64 hops, worst case hardware latency 6.4us.
• Cut-through routing• Adaptive routing
Collective network• Binary tree topology, static routing• Link bandwidth: 2.8Gb/s• Maximum hardware latency: 5us • With arithmetic and logical
hardware: can perform integer operation on the data– Efficient support for reduce, scan,
global sum, and broadcast operations– Floating point operation can be done
with 2 passes.
Barrier network
• Hardware support for global synchronization.
• 1.5us for barrier on 64K nodes.
IBM BlueGene/L summary
• Optimize cost/performance– limiting applications.
– Use low power design• Lower frequency, system-on-a-chip
• Great performance per watt metric
• Scalability support– Hardware support for global communication and
barrier
– Low latency, high bandwidth support
• Case 2: Infiniband architecture– Specification (Infiniband
architecture specification release 1.2.1, January 2008/Oct. 2006) available at Infiniband Trade Association (http://www.infinibandta.org)
• Infiniband architecture overview
• Infiniband architecture overview– Components:
• Links• Channel adaptors• Switches• Routers
– The specification allows Infiniband wide area network, but mostly adopted as a system/storage area network.
– Topology:• Irregular• Regular: Fat tree , hypercube, torus, etc.
– Link speed:• Single data rate (SDR): 2.5Gbps (X), 10Gbps (4X), and 30Gbps
(12X).• Double data rate (DDR): 5Gbps (X), 20 Gbps (4X)• Quad data rate (QDR): 40Gbps (4X)• Fourteen data rate (FDR): 56Gbps (4X)
• Layers: somewhat similar to TCP/IP– Physical layer
– Link layer• Error detection (CRC checksum)
• flow control (credit based)
• switching, virtual lanes (VL),
• forwarding table computed by subnet manager– Single path deterministic routing (not adaptive)
– Network layer: across subnets.• No use for the cluster environment
– Transport layer• Reliable/unreliable, connection/datagram
– Verbs: interface between adaptors and OS/Users
• Infiniband Link layer Packet format:
• Local Route Header (LRH): 8 bytes. Used for local routing by switches within a IBA subnet
• Global Route Header (GRH): 40 Bytes. Used for routing between subnets
• Base Transport header (BTH): 12 Bytes, for IBA transport• Extended transport header
– Reliable datagram extended transport header (RDETH): 4 bytes, just for reliable datagram
– Datagram extended transport header (DETH): 8 bytes– RDMA extended transport header (RETH): 16 bytes– Atomic, ACK, Atomic ACK,
• Immediate DATA extended transport header: 4 bytes, optimized for small packets.
• Invariant CRC and variant CRC: – CRC for fields not changed and changed.
• Local Route Header:
– Switching based on the destination port address (LID)
– Multipath switching by allocating multiple LIDs to one port
Subnet management
• Initialize the network– Discover subnet topology and topology changes, compute
the paths, assign LIDs, distribute the routes, configure devices.
– Related devices and entities• Devices: Channel Adapters (CA), Host Channel Adapters,
switches, routers• Subnet manager (SM): discovering, configuring, activating and
managing the subnet• A subnet management agent (SMA) in every device generates,
responses to control packets (subnet management packets (SMPs)), and configures local components for subnet management
• SM exchange control packets with SMA with subnet management interface (SMI).
• Subnet Management phases:– Topology discovery: sending direct routed SMP
to every port and processing the responses.– Path computation: computing valid paths
between each pair of end node– Path distribution phase: configuring the
forwarding table
• Base transport header:
• Verbs– OS/Users access the adaptor through verbs– Communication mechanism: Queue Pair (QP)
• Users can queue up a set of instructions that the hardware executes.
• A pair of queues in each QP: one for send, one for receive.
• Users can post send requests to the send queue and receive requests to the receive queue.
• Three types of send operations: SEND, RDMA-(WRITE, READ, ATOMIC), MEMORY-BINDING
• One receive operation (matching SEND)
• To communicate:– Make system calls to setup everything (open
QP, bind QP to port, bind complete queues, connect local QP to remote QP, register memory, etc).
– Post send/receive requests as user level instructions.
– Check completion.
• InfiniBand has an almost perfect software/network interface:– The network subsystem realizes most user level
functionality.• Network supports in-order delivery and and fault
tolerance.• Buffer management is pushed out to the user.
– OS bypass: User level accesses to the network interface. A few machine instructions will accomplish the transmission task without involving the OS.