Post on 24-Jan-2022
Current Tends in Parallel Current Tends in Parallel System InterconnectsSystem Interconnects
Eric Bohannon and Abdullah MuhammadEric Bohannon and Abdullah Muhammad05/16/0605/16/06
Multiple Processor SystemsMultiple Processor SystemsDr. Muhammad ShaabanDr. Muhammad Shaaban
OutlineOutline
Generic InterconnectsGeneric Interconnects•• InfiniBandInfiniBand•• MyrinetMyrinet•• SCI (Dolphin)SCI (Dolphin)•• QuadricsQuadrics
Custom InterconnectsCustom Interconnects•• IBM BlueGene/LIBM BlueGene/L•• Cray XT3Cray XT3
Comparisons of Generic InterconnectsComparisons of Generic Interconnects
IntroductionIntroduction
Interconnect Requirements Interconnect Requirements (Quiz 6 Anyone???)(Quiz 6 Anyone???)•• Low network latency Low network latency
Small network diameter, small average distanceSmall network diameter, small average distance
•• High network throughput High network throughput As many concurrent transfers as possibleAs many concurrent transfers as possible
•• Cost and performance ScalableCost and performance Scalable
Interconnect ConsiderationsInterconnect Considerations•• Types AvailableTypes Available
InfiniBandInfiniBandHistoryHistory•• Result of merging two designsResult of merging two designs
Future I/O by Compaq, IBM, and HewlettFuture I/O by Compaq, IBM, and Hewlett--PackardPackardNext Generation I/O by Intel, Microsoft, SunNext Generation I/O by Intel, Microsoft, Sun
•• Originally envisioned as a comprehensive SAN Originally envisioned as a comprehensive SAN that would connect CPU’s for high speed that would connect CPU’s for high speed communicationcommunication
•• Intended to be a replacement for PCIIntended to be a replacement for PCINot a replacement for Ethernet.Not a replacement for Ethernet.Not a wide area network. Used within a computer Not a wide area network. Used within a computer room facility (< 100 meters diameter)room facility (< 100 meters diameter)Not a replacement for Fibre ChannelNot a replacement for Fibre Channel
InfiniBand InfiniBand •• Open Standard, High Open Standard, High
Performance, Performance, Scalable Scalable Communication and Communication and I/O ArchitectureI/O Architecture
Each connection Each connection between nodes, between nodes, switches, and routers switches, and routers isis a pointa point--toto--point, point, serial connection serial connection Allows for multiple Allows for multiple connections from a connections from a single host, increasing single host, increasing overall availabilityoverall availabilitySupports up to 48,000 Supports up to 48,000 nodes per subnetnodes per subnet
InfiniBandInfiniBand
AdvantagesAdvantages•• Reduced CPU and memory overheads by Reduced CPU and memory overheads by
utilizing specialized HCA (Host Channel utilizing specialized HCA (Host Channel Adapters) hardware Adapters) hardware
•• Low endLow end--toto--end system latency (from 5 end system latency (from 5 to 10 microseconds, depending on the to 10 microseconds, depending on the application) application)
•• Consolidated I/O with network, Consolidated I/O with network, management, and storage all on one management, and storage all on one interface interface
InfiniBandInfiniBand
MyrinetMyrinet
HistoryHistory•• ANSI standard designed in 1998ANSI standard designed in 1998•• Intended to replace Ethernet by Intended to replace Ethernet by
minimizing protocol overheadminimizing protocol overheadIncrease throughputIncrease throughputDecrease latencyDecrease latencyLess interferenceLess interference
•• Programs should know about it Programs should know about it bypass call into operating systembypass call into operating system
MyrinetMyrinetCostCost--effective, higheffective, high--performance, packetperformance, packet--communication and switching technologycommunication and switching technology•• ‘D’, ‘E’, and ‘F’ Network Interface Cards‘D’, ‘E’, and ‘F’ Network Interface Cards•• The network is based on a Clos design that uses small The network is based on a Clos design that uses small
switch elements to build larger switches switch elements to build larger switches •• Data packets are source routed Data packets are source routed each host must know each host must know
the route to all of the other hosts through the switch the route to all of the other hosts through the switch fabric. fabric.
•• Switches are multipleSwitches are multiple--port components that route a port components that route a packet entering on an input channel of a port to the packet entering on an input channel of a port to the output channel of the port selected by the packet.output channel of the port selected by the packet.
•• NIC’s do most work, switches are simple NIC’s do most work, switches are simple
MyrinetMyrinet
MyrinetMyrinet
SCISCI
Scalable Coherent Interconnect historyScalable Coherent Interconnect history•• IEEE Standard developed from FuturebusIEEE Standard developed from Futurebus•• Intended to be single standard that could be Intended to be single standard that could be
used for buses in all computersused for buses in all computers•• Installed as an adapter to a PCI slotInstalled as an adapter to a PCI slot•• SUN Microsystems standardized SCI for all of SUN Microsystems standardized SCI for all of
their high performance systemstheir high performance systems•• Designed to connect a large number of nodes Designed to connect a large number of nodes
SCISCI
SwitchSwitch--less Networkless Network•• Connected to one another using either a Connected to one another using either a
Ring, 2D Torus, or 3D Torus Ring, 2D Torus, or 3D Torus (unidirectional point(unidirectional point--toto--point links in a point links in a ring/ringlet topology)ring/ringlet topology)
•• SCI chips on the NICs handle all of the SCI chips on the NICs handle all of the routing routing
•• SCI use 64 bits addressing and the most SCI use 64 bits addressing and the most significant 16 bits are used for significant 16 bits are used for addressing up to 64K nodes.addressing up to 64K nodes.
SCISCI
Fully distributed and scalable Fully distributed and scalable Nodes maintains two queues, which Nodes maintains two queues, which
serve as buffers until transmission serve as buffers until transmission bandwidth becomes available for bandwidth becomes available for outbound packets or until inbound outbound packets or until inbound packets can be processed by the packets can be processed by the nodes application logic. nodes application logic. Transactions are split into a Transactions are split into a requestrequestand a and a responseresponse subsub--action action
SCISCI
SCISCI
AdvantagesAdvantages•• It is not only a System Area Network, it It is not only a System Area Network, it
also allows remote memory accesses. also allows remote memory accesses. •• Suitable for both message passing and Suitable for both message passing and
shared memory programming on shared memory programming on clusters. clusters.
QuadricsQuadrics
SSupercomputerupercomputer company formed in company formed in 19961996In June 2004 the 2nd and the 3rd In June 2004 the 2nd and the 3rd fastest supercomputers used QsNet, fastest supercomputers used QsNet, the Quadrics interconnect the Quadrics interconnect MultiMulti--teraflop systems can be teraflop systems can be constructed from commodity serversconstructed from commodity servershigh performance PCIhigh performance PCI--X interfaces X interfaces
QuadricsQuadrics
A 'fat tree' A 'fat tree' topology is usedtopology is usedBasic component Basic component of the QsNetII of the QsNetII network is an 8 network is an 8 port custom switch port custom switch
QuadricsQuadrics
QuadricsQuadrics
Elan 4 network Elan 4 network interface cardinterface card
Elite 4 switch Elite 4 switch componentcomponent
QsNet II SwitchQsNet II Switch
QsNet II Components
IBM BlueGene/LIBM BlueGene/LNodes are interconnected through five Nodes are interconnected through five networksnetworks•• A 3D torus network for pointA 3D torus network for point--toto--point point
messaging between compute nodesmessaging between compute nodes•• A global combining/broadcast tree for A global combining/broadcast tree for
collective operations such as MPI_Allreduce collective operations such as MPI_Allreduce over the entire applicationover the entire application
•• A global barrier and interrupt networkA global barrier and interrupt network•• A Gigabit Ethernet to JTAG network for A Gigabit Ethernet to JTAG network for
machine controlmachine control•• Another Gigabit Ethernet network for Another Gigabit Ethernet network for
connection to other systems, such as hosts connection to other systems, such as hosts and file systems.and file systems.
IBM BlueGene/L InterconnectsIBM BlueGene/L Interconnects3D 3D -- TorusTorus•• Used mainly for application Used mainly for application
messaging messaging •• Low Latency Low Latency –– High Bandwidth High Bandwidth
PointPoint--toto--Point Message Passing.Point Message Passing.•• 175 MB/sec in each direction 175 MB/sec in each direction •• Messages passed through Messages passed through
intermediate nodes using cutintermediate nodes using cut--through traffic with a transit through traffic with a transit delay of 100 ns per node.delay of 100 ns per node.
•• Adaptive Routing in each nodeAdaptive Routing in each node•• Network diameter is 64 nodes Network diameter is 64 nodes ––
resulting in a maximum transit resulting in a maximum transit delay of 64 delay of 64 µµss
IBM BlueGene/L InterconnectsIBM BlueGene/L InterconnectsGlobal Collective NetworkGlobal Collective Network•• Used for globally Used for globally
broadcasting databroadcasting data•• Spans the whole networkSpans the whole network•• Every link has bandwidth Every link has bandwidth
2.8 Gb/s2.8 Gb/s•• One node can send data to One node can send data to
all the other nodes, or just all the other nodes, or just a subset of all the nodes in a subset of all the nodes in less than 5 less than 5 µµss
•• Arithmetic & Logical Arithmetic & Logical Operators Operators –– min, max, min, max, sum, bitwise Logical OR, sum, bitwise Logical OR, AND, and XOR AND, and XOR -- are built are built into the network hardwareinto the network hardware
IBM BlueGene/L InterconnectsIBM BlueGene/L InterconnectsControl Systems NetworkControl Systems Network•• Used to initialize, monitor and Used to initialize, monitor and
control the control devices and control the control devices and sensorssensors
•• More than 250,000 endpoints (eg. More than 250,000 endpoints (eg. ASICs, temperature sensors, power ASICs, temperature sensors, power supplies, clock trees, fans, status supplies, clock trees, fans, status LEDs)LEDs)
•• Controlled by a service nodeControlled by a service node•• An FPGA converts the 100Mb packets An FPGA converts the 100Mb packets
into control packetsinto control packetsGigabit Ethernet NetworkGigabit Ethernet Network•• Used in the I/O NodesUsed in the I/O Nodes•• Connect to external parallel file Connect to external parallel file
systemsystem•• Maximum I/O to Compute Node Maximum I/O to Compute Node
ratio: 1:8 ratio: 1:8 –– resulting in a maximum resulting in a maximum of 1024 I/O nodes with total I/O of 1024 I/O nodes with total I/O Bandwidth of > 1 TbpsBandwidth of > 1 Tbps
Cray XT3Cray XT3Cray’s third generation of Massively Cray’s third generation of Massively Parallel ProcessorsParallel Processors3D Torus Topology3D Torus TopologyDesigned upon a single processor nodeDesigned upon a single processor node•• One AMD OpteronOne AMD Opteron
Has own memoryHas own memory•• Dedicated communication resource Dedicated communication resource
Cray SeaStar routing and communication chipCray SeaStar routing and communication chipEliminates the cost and complexity of external Eliminates the cost and complexity of external switchesswitches
Two Main Types of Processing ElementsTwo Main Types of Processing Elements•• Compute PEsCompute PEs•• Service PEsService PEs
Cray XT3 ArchitectureCray XT3 Architecture
Cray SeaStar ChipCray SeaStar Chip
The Cray SeaStar chip combines The Cray SeaStar chip combines communications processing and high communications processing and high speed routing on a single device. speed routing on a single device. •• HyperTransport linkHyperTransport link•• Direct Memory Access (DMA) engineDirect Memory Access (DMA) engine•• a communications and management a communications and management
processorprocessor•• a higha high--speed interconnect routerspeed interconnect router•• a service porta service port
Cray SeaStar ChipCray SeaStar Chip
HyperTransport LinkHyperTransport Link••Connects the Opteron Connects the Opteron Processor with the Cray Seastar Processor with the Cray Seastar ChipChip
••Bandwidth of 6.4 GB/sBandwidth of 6.4 GB/s
Cray SeaStar ChipCray SeaStar Chip
DMA EngineDMA Engine•• Has an associated PowerPC 440 Has an associated PowerPC 440
ProcessorProcessor•• Used to offUsed to off--load message passing load message passing
operations and demultiplexing tasks operations and demultiplexing tasks from the Opteron Processorfrom the Opteron Processor
•• Establishes direct path between the Establishes direct path between the application to the communication application to the communication hardwarehardware
•• Bypasses any traps or interrupts that Bypasses any traps or interrupts that are associated with traversing a are associated with traversing a protected Kernelprotected Kernel
Cray SeaStar ChipCray SeaStar Chip
Interconnect routerInterconnect router•• provides six highprovides six high--speed network links speed network links
which connect to six neighbors in the 3D which connect to six neighbors in the 3D torustorus
•• Peak bidirectional bandwidth of each link Peak bidirectional bandwidth of each link is 7.6 GB/sis 7.6 GB/s
•• Sustained bandwidth greater than 4 Sustained bandwidth greater than 4 GB/sGB/s
•• Aggregate bandwidth of 45.6 GB/sAggregate bandwidth of 45.6 GB/s•• The router also includes reliable link The router also includes reliable link
protocol with error correction and protocol with error correction and retransmission.retransmission.
Cray SeaStar ChipCray SeaStar Chip
Service PortService Port•• Bridges between the management Bridges between the management network and the Seastar local busnetwork and the Seastar local bus
•• Allows access to all registers and Allows access to all registers and memorymemory
•• Facilitates booting, maintenance, Facilitates booting, maintenance, and system monitoringand system monitoring
Cray XT3 ImplementationCray XT3 Implementation
Bigben Bigben (Pittsburg Supercomputing Center)(Pittsburg Supercomputing Center)
•• A Cray XT3 MPP systemA Cray XT3 MPP system•• 2068 compute processors2068 compute processors•• Each processor has its own 1 Gbyte of Each processor has its own 1 Gbyte of
memory. memory.
Interconnects/SystemsInterconnects/Systems
Interconnects/PerformanceInterconnects/Performance
Interconnect Latency/BandwidthInterconnect Latency/Bandwidth
~800~800457.5457.54.24.2DolphinDolphin
~576~576~875~875--9109101.291.29QuadricsQuadrics
~1,000~1,0001,2001,2002.02.0MyriMyri--10G10G
~1,000~1,000~493~4932.72.7Myrinet E (mx)Myrinet E (mx)
~1,000~1,000~493~4932.62.6Myrinet F (mx)Myrinet F (mx)
~1,000~1,000~493~4933.53.5Myrinet D (mx)Myrinet D (mx)
5125127607604.14.1Infiniband: Mellanox Infinihost (PCIInfiniband: Mellanox Infinihost (PCI--X)X)
~100,000~100,000++
~862~8629.69.610 GigE: Chelsio (Copper)10 GigE: Chelsio (Copper)
~8,000~8,000~125~125~29~29--120120GigEGigE
N/2N/2(Bytes) *(Bytes) *
Bandwidth Bandwidth (MBps)(MBps)
Latency Latency (microseconds)(microseconds)InterconnectInterconnect
* The N/2 packet size is the size of the packets in bytes that reach half the bandwidth of the interconnect.It is important because it tells if small packets get good bandwidth performance
Interconnect CostsInterconnect Costs
$140,160.00 $140,160.00 NANA$7,800.00 $7,800.00 Dolphin13Dolphin13
$205,538.00 $205,538.00 $43,698.00 $43,698.00 $13,073.00 $13,073.00 Quadrics12Quadrics12
$153,600.00 $153,600.00 $28,800.00 $28,800.00 $9,600.00 $9,600.00 MyriMyri--10G1110G11
$192,000.00 $192,000.00 $36,000.00 $36,000.00 $12,000.00 $12,000.00 Myrinet E (gm/mx)10Myrinet E (gm/mx)10
$128,000.00 $128,000.00 $24,000.00 $24,000.00 $8,000.00 $8,000.00 Myrinet F (gm/mx)9Myrinet F (gm/mx)9
$115,200.00 $115,200.00 $21,600.00 $21,600.00 $7,200.00 $7,200.00 Myrinet D (gm/mx)8Myrinet D (gm/mx)8
$182,083.00 $182,083.00 $23,084.00 $23,084.00 $11,877.00 $11,877.00 Infiniband: Voltaire6Infiniband: Voltaire6
$447,360.00 $447,360.00 $62,280.00 $62,280.00 $15,960.00 $15,960.00 10 GigE: Chelsio (Copper)510 GigE: Chelsio (Copper)5
$27,328.00 $27,328.00 $944.00 $944.00 $258.00 $258.00 GigE1GigE1
128 Node Cost128 Node Cost24 Node Cost24 Node Cost8 Node Cost8 Node CostInterconnectInterconnect
SpecificationsSpecifications
Fine (tightly coupled) Fine (tightly coupled) --mediummedium--grained grained parallelparallel
Fine (tightly coupled) Fine (tightly coupled) -- mediummedium--grained parallel grained parallel applicationsapplications
CoarseCoarse--grained grained (loosely (loosely coupled) coupled) parallel parallel applications; applications; some some mediummedium--grained grained parallel parallel applicationsapplications
Cluster ApplicationsCluster Applications
1X (4 pins) 1X (4 pins) -- 2.5 Gbps; 2.5 Gbps; 4X (16 pins) 4X (16 pins) -- 10 10 Gbps; 12X (48 Gbps; 12X (48 pins) pins) -- 30 Gbps30 Gbps
10 Gbps10 Gbps10 Gbps (up to 40 10 Gbps (up to 40 Gbps planned)Gbps planned)
Line Speed / BandwidthLine Speed / Bandwidth
160 ns160 ns200 ns200 ns10,000 ns10,000 nsSwitch LatencySwitch Latency
3%3%6%6%80%80%Approx. CPU OverheadApprox. CPU Overhead
Host Channel Adapter Host Channel Adapter (HCA) + Switched, (HCA) + Switched, channelchannel--based based interconnection interconnection fabric of switchesfabric of switches
Host Adapter + Host Adapter + Switch Switch ComponentComponent
Network Interface Network Interface Card + SwitchCard + SwitchComponentsComponents
InfiniBandInfiniBandMyrinetMyrinetGigabit EthernetGigabit Ethernet
MPI Broadcast ComparisonMPI Broadcast ComparisonInfiniBand Myrinet
Questions?Questions?