Interconnection network network interface and a case study.
-
Upload
martha-maxwell -
Category
Documents
-
view
240 -
download
0
Transcript of Interconnection network network interface and a case study.
Interconnection network network interface and a case
study
Network interface design issue• The networking requirement user’s perspective
– In-order message delivery– Reliable delivery
• Error control• Flow control
– Deadlock free• Typical network hardware features– Arbitrary delivery order (adaptive/multipath routing)– Finite buffering– Limited fault handling
• How and where should we bridge the gap?– Network hardware? Network systems? Or a
hardware/systems/software approach?
The Internet approach
– How does the Internet realize these functions?• No deadlock issue• Reliability/flow control/in-order delivery are done at the TCP
layer?• The network layer (IP) provides best effort service.
– IP is done in the software as well.
– Drawbacks:• Too many layers of software• Users need to go through the OS to access the communication
hardware (system calls can cause context switching).
Approach in HPC networks
• Where should these functions be realized?– High performance networking• Most functionality below the network layer are done
by the hardware (or almost hardware)– This provide the APIs for network transactions
• If there is mis-match between what the network provides and what users want, a software messaging layer is created to bridge the gaps.
Messaging Layer
• Bridge between the hardware functionality and the user communication requirement– Typical network hardware features
• Arbitrary delivery order (adaptive/multipath routing)• Finite buffering• Limited fault handling
– Typical user communication requirement• In-order delivery• End-to-end flow control• Reliable transmission
Messaging Layer
Communication cost
• Communication cost = hardware cost + software cost (messaging layer cost)– Hardware message time: msize/bandwidth– Software time:
• Buffer management• End-to-end flow control• Running protocols
– Which one is dominating?• Depends on how much the software has to do.
Network software/hardware interaction -- a case study
• A case study on the communication performance issues on CM5– V. Karamcheti and A. A. Chien, “Software
Overhead in Messaging layers: Where does the time go?” ACM ASPLOS-VI, 1994.
What do we see in the study?• The mis-match between the user requirement
and network functionality can introduce significant software overheads (50%-70%).
• Implication?– Should we focus on hardware or software or
software/hardware co-design?– Improving routing performance may increase
software cost• Adaptive routing introduces out of order packets
– Providing low level network feature to applications is problematic.
Summary from the study
• In the design of the communication system, holistic understanding must be achieved:– Focusing on network hardware may not be
sufficient. Software overhead can be much larger than routing time.
• It would be ideal for the network to directly provide high level services.– The newer generation interconnect hardware tries
to achieve this.
Case study
• IBM Bluegene/L system• InfiniBand
Interconnect Family share for 06/2011 top 500 supercomputers
Interconnect Family Count Share % Rmax Sum (GF) Rpeak Sum
(GF) Processor Sum
Myrinet 4 0.80 % 384451 524412 55152
Quadrics 1 0.20 % 52840 63795 9968
Gigabit Ethernet 232 46.40 % 11796979 22042181 2098562
Infiniband 206 41.20 % 22980393 32759581 2411516
Mixed 1 0.20 % 66567 82944 13824
NUMAlink 2 0.40 % 107961 121241 18944
SP Switch 1 0.20 % 75760 92781 12208
Proprietary 29 5.80 % 9841862 13901082 1886982
Fat Tree 1 0.20 % 122400 131072 1280
Custom 23 4.60 % 13500813 15460859 1271488
Totals 500 100% 58930025.59 85179949.00 7779924
Overview of the IBM Blue Gene/L System Architecture
• Design objectives• Hardware overview– System architecture– Node architecture– Interconnect architecture
Highlights
• A 64K-node highly integrated supercomputer based on system-on-a-chip technology– Two ASICs
• Blue Gene/L compute (BLC), Blue Gene/L Link (BLL)
• Distributed memory, massively parallel processing (MPP) architecture.
• Use the message passing programming model (MPI).• 360 Tflops peak performance• Optimized for cost/performance
Design objectives
• Objective 1: 360-Tflops supercomputer– Earth Simulator (Japan, fastest supercomputer
from 2002 to 2004): 35.86 Tflops• Objective 2: power efficiency– Performance/rack = performance/watt *
watt/rack• Watt/rack is a constant of around 20kW• Performance/watt determines performance/rack
• Power efficiency: – 360Tflops => 20 megawatts with conventional
processors– Need low-power processor design (2-10 times better
power efficiency)
Design objectives (continue)
• Objective 3: extreme scalability– Optimized for cost/performance use low
power, less powerful processors need a lot of processors• Up to 65536 processors.
– Interconnect scalability– Reliability, availability, and serviceability– Application scalability
Blue Gene/L system components
Blue Gene/L Compute ASIC
• 2 Power PC440 cores with floating-point enhancements– 700MHz– Everything of a typical superscalar processor• Pipelined microarchitecture with dual instruction fetch,
decode, and out of order issue, out of order dispatch, out of order execution and out of order completion, etc
– 1 W each through extensive power management
Blue Gene/L Compute ASIC
Memory system on a BGL node
• BG/L only supports distributed memory paradigm.• No need for efficient support for cache coherence on
each node.– Coherence enforced by software if needed.
• Two cores operate in two modes:– Communication coprocessor mode
• Need coherence, managed in system level libraries
– Virtual node mode• Memory is physical partitioned (not shared).
Blue Gene/L networks
• Five networks.– 100 Mbps Ethernet control network for diagnostics,
debugging, and some other things.– 1000 Mbps Ethernet for I/O– Three high-band width, low-latency networks for data
transmission and synchronization.• 3-D torus network for point-to-point communication• Collective network for global operations• Barrier network
• All network logic is integrated in the BG/L node ASIC– Memory mapped interfaces from user space
3-D torus network
• Support p2p communication• Link bandwidth 1.4Gb/s, 6
bidirectional link per node (1.2GB/s).
• 64x32x32 torus: diameter 32+16+16=64 hops, worst case hardware latency 6.4us.
• Cut-through routing• Adaptive routing
Collective network• Binary tree topology, static
routing• Link bandwidth: 2.8Gb/s• Maximum hardware latency: 5us • With arithmetic and logical
hardware: can perform integer operation on the data– Efficient support for reduce, scan,
global sum, and broadcast operations
– Floating point operation can be done with 2 passes.
Barrier network
• Hardware support for global synchronization.• 1.5us for barrier on 64K nodes.
IBM BlueGene/L summary
• Optimize cost/performance– limiting applications.– Use low power design
• Lower frequency, system-on-a-chip• Great performance per watt metric
• Scalability support– Hardware support for global communication and barrier– Low latency, high bandwidth support