Current Tends in Parallel System Interconnects

Current Tends in Parallel Current Tends in Parallel System InterconnectsSystem Interconnects

Eric Bohannon and Abdullah MuhammadEric Bohannon and Abdullah Muhammad05/16/0605/16/06

Multiple Processor SystemsMultiple Processor SystemsDr. Muhammad ShaabanDr. Muhammad Shaaban

OutlineOutline

Generic InterconnectsGeneric Interconnects•• InfiniBandInfiniBand•• MyrinetMyrinet•• SCI (Dolphin)SCI (Dolphin)•• QuadricsQuadrics

Custom InterconnectsCustom Interconnects•• IBM BlueGene/LIBM BlueGene/L•• Cray XT3Cray XT3

Comparisons of Generic InterconnectsComparisons of Generic Interconnects

IntroductionIntroduction

Interconnect Requirements Interconnect Requirements (Quiz 6 Anyone???)(Quiz 6 Anyone???)•• Low network latency Low network latency

Small network diameter, small average distanceSmall network diameter, small average distance

•• High network throughput High network throughput As many concurrent transfers as possibleAs many concurrent transfers as possible

•• Cost and performance ScalableCost and performance Scalable

Interconnect ConsiderationsInterconnect Considerations•• Types AvailableTypes Available

InfiniBandInfiniBandHistoryHistory•• Result of merging two designsResult of merging two designs

Future I/O by Compaq, IBM, and HewlettFuture I/O by Compaq, IBM, and Hewlett--PackardPackardNext Generation I/O by Intel, Microsoft, SunNext Generation I/O by Intel, Microsoft, Sun

•• Originally envisioned as a comprehensive SAN Originally envisioned as a comprehensive SAN that would connect CPU’s for high speed that would connect CPU’s for high speed communicationcommunication

•• Intended to be a replacement for PCIIntended to be a replacement for PCINot a replacement for Ethernet.Not a replacement for Ethernet.Not a wide area network. Used within a computer Not a wide area network. Used within a computer room facility (< 100 meters diameter)room facility (< 100 meters diameter)Not a replacement for Fibre ChannelNot a replacement for Fibre Channel

InfiniBand InfiniBand •• Open Standard, High Open Standard, High

Performance, Performance, Scalable Scalable Communication and Communication and I/O ArchitectureI/O Architecture

Each connection Each connection between nodes, between nodes, switches, and routers switches, and routers isis a pointa point--toto--point, point, serial connection serial connection Allows for multiple Allows for multiple connections from a connections from a single host, increasing single host, increasing overall availabilityoverall availabilitySupports up to 48,000 Supports up to 48,000 nodes per subnetnodes per subnet

InfiniBandInfiniBand

AdvantagesAdvantages•• Reduced CPU and memory overheads by Reduced CPU and memory overheads by

utilizing specialized HCA (Host Channel utilizing specialized HCA (Host Channel Adapters) hardware Adapters) hardware

•• Low endLow end--toto--end system latency (from 5 end system latency (from 5 to 10 microseconds, depending on the to 10 microseconds, depending on the application) application)

•• Consolidated I/O with network, Consolidated I/O with network, management, and storage all on one management, and storage all on one interface interface

InfiniBandInfiniBand

MyrinetMyrinet

HistoryHistory•• ANSI standard designed in 1998ANSI standard designed in 1998•• Intended to replace Ethernet by Intended to replace Ethernet by

minimizing protocol overheadminimizing protocol overheadIncrease throughputIncrease throughputDecrease latencyDecrease latencyLess interferenceLess interference

•• Programs should know about it Programs should know about it bypass call into operating systembypass call into operating system

MyrinetMyrinetCostCost--effective, higheffective, high--performance, packetperformance, packet--communication and switching technologycommunication and switching technology•• ‘D’, ‘E’, and ‘F’ Network Interface Cards‘D’, ‘E’, and ‘F’ Network Interface Cards•• The network is based on a Clos design that uses small The network is based on a Clos design that uses small

switch elements to build larger switches switch elements to build larger switches •• Data packets are source routed Data packets are source routed each host must know each host must know

the route to all of the other hosts through the switch the route to all of the other hosts through the switch fabric. fabric.

•• Switches are multipleSwitches are multiple--port components that route a port components that route a packet entering on an input channel of a port to the packet entering on an input channel of a port to the output channel of the port selected by the packet.output channel of the port selected by the packet.

•• NIC’s do most work, switches are simple NIC’s do most work, switches are simple

MyrinetMyrinet

SCISCI

Scalable Coherent Interconnect historyScalable Coherent Interconnect history•• IEEE Standard developed from FuturebusIEEE Standard developed from Futurebus•• Intended to be single standard that could be Intended to be single standard that could be

used for buses in all computersused for buses in all computers•• Installed as an adapter to a PCI slotInstalled as an adapter to a PCI slot•• SUN Microsystems standardized SCI for all of SUN Microsystems standardized SCI for all of

their high performance systemstheir high performance systems•• Designed to connect a large number of nodes Designed to connect a large number of nodes

SCISCI

SwitchSwitch--less Networkless Network•• Connected to one another using either a Connected to one another using either a

Ring, 2D Torus, or 3D Torus Ring, 2D Torus, or 3D Torus (unidirectional point(unidirectional point--toto--point links in a point links in a ring/ringlet topology)ring/ringlet topology)

•• SCI chips on the NICs handle all of the SCI chips on the NICs handle all of the routing routing

•• SCI use 64 bits addressing and the most SCI use 64 bits addressing and the most significant 16 bits are used for significant 16 bits are used for addressing up to 64K nodes.addressing up to 64K nodes.

SCISCI

Fully distributed and scalable Fully distributed and scalable Nodes maintains two queues, which Nodes maintains two queues, which

serve as buffers until transmission serve as buffers until transmission bandwidth becomes available for bandwidth becomes available for outbound packets or until inbound outbound packets or until inbound packets can be processed by the packets can be processed by the nodes application logic. nodes application logic. Transactions are split into a Transactions are split into a requestrequestand a and a responseresponse subsub--action action

SCISCI

AdvantagesAdvantages•• It is not only a System Area Network, it It is not only a System Area Network, it

also allows remote memory accesses. also allows remote memory accesses. •• Suitable for both message passing and Suitable for both message passing and

shared memory programming on shared memory programming on clusters. clusters.

QuadricsQuadrics

SSupercomputerupercomputer company formed in company formed in 19961996In June 2004 the 2nd and the 3rd In June 2004 the 2nd and the 3rd fastest supercomputers used QsNet, fastest supercomputers used QsNet, the Quadrics interconnect the Quadrics interconnect MultiMulti--teraflop systems can be teraflop systems can be constructed from commodity serversconstructed from commodity servershigh performance PCIhigh performance PCI--X interfaces X interfaces

QuadricsQuadrics

A 'fat tree' A 'fat tree' topology is usedtopology is usedBasic component Basic component of the QsNetII of the QsNetII network is an 8 network is an 8 port custom switch port custom switch

QuadricsQuadrics

Elan 4 network Elan 4 network interface cardinterface card

Elite 4 switch Elite 4 switch componentcomponent

QsNet II SwitchQsNet II Switch

QsNet II Components

IBM BlueGene/LIBM BlueGene/LNodes are interconnected through five Nodes are interconnected through five networksnetworks•• A 3D torus network for pointA 3D torus network for point--toto--point point

messaging between compute nodesmessaging between compute nodes•• A global combining/broadcast tree for A global combining/broadcast tree for

collective operations such as MPI_Allreduce collective operations such as MPI_Allreduce over the entire applicationover the entire application

•• A global barrier and interrupt networkA global barrier and interrupt network•• A Gigabit Ethernet to JTAG network for A Gigabit Ethernet to JTAG network for

machine controlmachine control•• Another Gigabit Ethernet network for Another Gigabit Ethernet network for

connection to other systems, such as hosts connection to other systems, such as hosts and file systems.and file systems.

IBM BlueGene/L InterconnectsIBM BlueGene/L Interconnects3D 3D -- TorusTorus•• Used mainly for application Used mainly for application

messaging messaging •• Low Latency Low Latency –– High Bandwidth High Bandwidth

PointPoint--toto--Point Message Passing.Point Message Passing.•• 175 MB/sec in each direction 175 MB/sec in each direction •• Messages passed through Messages passed through

intermediate nodes using cutintermediate nodes using cut--through traffic with a transit through traffic with a transit delay of 100 ns per node.delay of 100 ns per node.

•• Adaptive Routing in each nodeAdaptive Routing in each node•• Network diameter is 64 nodes Network diameter is 64 nodes ––

resulting in a maximum transit resulting in a maximum transit delay of 64 delay of 64 µµss

IBM BlueGene/L InterconnectsIBM BlueGene/L InterconnectsGlobal Collective NetworkGlobal Collective Network•• Used for globally Used for globally

broadcasting databroadcasting data•• Spans the whole networkSpans the whole network•• Every link has bandwidth Every link has bandwidth

2.8 Gb/s2.8 Gb/s•• One node can send data to One node can send data to

all the other nodes, or just all the other nodes, or just a subset of all the nodes in a subset of all the nodes in less than 5 less than 5 µµss

•• Arithmetic & Logical Arithmetic & Logical Operators Operators –– min, max, min, max, sum, bitwise Logical OR, sum, bitwise Logical OR, AND, and XOR AND, and XOR -- are built are built into the network hardwareinto the network hardware

IBM BlueGene/L InterconnectsIBM BlueGene/L InterconnectsControl Systems NetworkControl Systems Network•• Used to initialize, monitor and Used to initialize, monitor and

control the control devices and control the control devices and sensorssensors

•• More than 250,000 endpoints (eg. More than 250,000 endpoints (eg. ASICs, temperature sensors, power ASICs, temperature sensors, power supplies, clock trees, fans, status supplies, clock trees, fans, status LEDs)LEDs)

•• Controlled by a service nodeControlled by a service node•• An FPGA converts the 100Mb packets An FPGA converts the 100Mb packets

into control packetsinto control packetsGigabit Ethernet NetworkGigabit Ethernet Network•• Used in the I/O NodesUsed in the I/O Nodes•• Connect to external parallel file Connect to external parallel file

systemsystem•• Maximum I/O to Compute Node Maximum I/O to Compute Node

ratio: 1:8 ratio: 1:8 –– resulting in a maximum resulting in a maximum of 1024 I/O nodes with total I/O of 1024 I/O nodes with total I/O Bandwidth of > 1 TbpsBandwidth of > 1 Tbps

Cray XT3Cray XT3Cray’s third generation of Massively Cray’s third generation of Massively Parallel ProcessorsParallel Processors3D Torus Topology3D Torus TopologyDesigned upon a single processor nodeDesigned upon a single processor node•• One AMD OpteronOne AMD Opteron

Has own memoryHas own memory•• Dedicated communication resource Dedicated communication resource

Cray SeaStar routing and communication chipCray SeaStar routing and communication chipEliminates the cost and complexity of external Eliminates the cost and complexity of external switchesswitches

Two Main Types of Processing ElementsTwo Main Types of Processing Elements•• Compute PEsCompute PEs•• Service PEsService PEs

Cray XT3 ArchitectureCray XT3 Architecture

Cray SeaStar ChipCray SeaStar Chip

The Cray SeaStar chip combines The Cray SeaStar chip combines communications processing and high communications processing and high speed routing on a single device. speed routing on a single device. •• HyperTransport linkHyperTransport link•• Direct Memory Access (DMA) engineDirect Memory Access (DMA) engine•• a communications and management a communications and management

processorprocessor•• a higha high--speed interconnect routerspeed interconnect router•• a service porta service port

HyperTransport LinkHyperTransport Link••Connects the Opteron Connects the Opteron Processor with the Cray Seastar Processor with the Cray Seastar ChipChip

••Bandwidth of 6.4 GB/sBandwidth of 6.4 GB/s

DMA EngineDMA Engine•• Has an associated PowerPC 440 Has an associated PowerPC 440

ProcessorProcessor•• Used to offUsed to off--load message passing load message passing

operations and demultiplexing tasks operations and demultiplexing tasks from the Opteron Processorfrom the Opteron Processor

•• Establishes direct path between the Establishes direct path between the application to the communication application to the communication hardwarehardware

•• Bypasses any traps or interrupts that Bypasses any traps or interrupts that are associated with traversing a are associated with traversing a protected Kernelprotected Kernel

Interconnect routerInterconnect router•• provides six highprovides six high--speed network links speed network links

which connect to six neighbors in the 3D which connect to six neighbors in the 3D torustorus

•• Peak bidirectional bandwidth of each link Peak bidirectional bandwidth of each link is 7.6 GB/sis 7.6 GB/s

•• Sustained bandwidth greater than 4 Sustained bandwidth greater than 4 GB/sGB/s

•• Aggregate bandwidth of 45.6 GB/sAggregate bandwidth of 45.6 GB/s•• The router also includes reliable link The router also includes reliable link

protocol with error correction and protocol with error correction and retransmission.retransmission.

Service PortService Port•• Bridges between the management Bridges between the management network and the Seastar local busnetwork and the Seastar local bus

•• Allows access to all registers and Allows access to all registers and memorymemory

•• Facilitates booting, maintenance, Facilitates booting, maintenance, and system monitoringand system monitoring

Cray XT3 ImplementationCray XT3 Implementation

Bigben Bigben (Pittsburg Supercomputing Center)(Pittsburg Supercomputing Center)

•• A Cray XT3 MPP systemA Cray XT3 MPP system•• 2068 compute processors2068 compute processors•• Each processor has its own 1 Gbyte of Each processor has its own 1 Gbyte of

memory. memory.

Interconnects/SystemsInterconnects/Systems

Interconnects/PerformanceInterconnects/Performance

Interconnect Latency/BandwidthInterconnect Latency/Bandwidth

~800~800457.5457.54.24.2DolphinDolphin

~576~576~875~875--9109101.291.29QuadricsQuadrics

~1,000~1,0001,2001,2002.02.0MyriMyri--10G10G

~1,000~1,000~493~4932.72.7Myrinet E (mx)Myrinet E (mx)

~1,000~1,000~493~4932.62.6Myrinet F (mx)Myrinet F (mx)

~1,000~1,000~493~4933.53.5Myrinet D (mx)Myrinet D (mx)

5125127607604.14.1Infiniband: Mellanox Infinihost (PCIInfiniband: Mellanox Infinihost (PCI--X)X)

~100,000~100,000++

~862~8629.69.610 GigE: Chelsio (Copper)10 GigE: Chelsio (Copper)

~8,000~8,000~125~125~29~29--120120GigEGigE

N/2N/2(Bytes) *(Bytes) *

Bandwidth Bandwidth (MBps)(MBps)

Latency Latency (microseconds)(microseconds)InterconnectInterconnect

* The N/2 packet size is the size of the packets in bytes that reach half the bandwidth of the interconnect.It is important because it tells if small packets get good bandwidth performance

Interconnect CostsInterconnect Costs

$140,160.00 $140,160.00 NANA$7,800.00 $7,800.00 Dolphin13Dolphin13

$205,538.00 $205,538.00 $43,698.00 $43,698.00 $13,073.00 $13,073.00 Quadrics12Quadrics12

$153,600.00 $153,600.00 $28,800.00 $28,800.00 $9,600.00 $9,600.00 MyriMyri--10G1110G11

$192,000.00 $192,000.00 $36,000.00 $36,000.00 $12,000.00 $12,000.00 Myrinet E (gm/mx)10Myrinet E (gm/mx)10

$128,000.00 $128,000.00 $24,000.00 $24,000.00 $8,000.00 $8,000.00 Myrinet F (gm/mx)9Myrinet F (gm/mx)9

$115,200.00 $115,200.00 $21,600.00 $21,600.00 $7,200.00 $7,200.00 Myrinet D (gm/mx)8Myrinet D (gm/mx)8

$182,083.00 $182,083.00 $23,084.00 $23,084.00 $11,877.00 $11,877.00 Infiniband: Voltaire6Infiniband: Voltaire6

$447,360.00 $447,360.00 $62,280.00 $62,280.00 $15,960.00 $15,960.00 10 GigE: Chelsio (Copper)510 GigE: Chelsio (Copper)5

$27,328.00 $27,328.00 $944.00 $944.00 $258.00 $258.00 GigE1GigE1

128 Node Cost128 Node Cost24 Node Cost24 Node Cost8 Node Cost8 Node CostInterconnectInterconnect

SpecificationsSpecifications

Fine (tightly coupled) Fine (tightly coupled) --mediummedium--grained grained parallelparallel

Fine (tightly coupled) Fine (tightly coupled) -- mediummedium--grained parallel grained parallel applicationsapplications

CoarseCoarse--grained grained (loosely (loosely coupled) coupled) parallel parallel applications; applications; some some mediummedium--grained grained parallel parallel applicationsapplications

Cluster ApplicationsCluster Applications

1X (4 pins) 1X (4 pins) -- 2.5 Gbps; 2.5 Gbps; 4X (16 pins) 4X (16 pins) -- 10 10 Gbps; 12X (48 Gbps; 12X (48 pins) pins) -- 30 Gbps30 Gbps

10 Gbps10 Gbps10 Gbps (up to 40 10 Gbps (up to 40 Gbps planned)Gbps planned)

Line Speed / BandwidthLine Speed / Bandwidth

160 ns160 ns200 ns200 ns10,000 ns10,000 nsSwitch LatencySwitch Latency

3%3%6%6%80%80%Approx. CPU OverheadApprox. CPU Overhead

Host Channel Adapter Host Channel Adapter (HCA) + Switched, (HCA) + Switched, channelchannel--based based interconnection interconnection fabric of switchesfabric of switches

Host Adapter + Host Adapter + Switch Switch ComponentComponent

Network Interface Network Interface Card + SwitchCard + SwitchComponentsComponents

InfiniBandInfiniBandMyrinetMyrinetGigabit EthernetGigabit Ethernet

MPI Broadcast ComparisonMPI Broadcast ComparisonInfiniBand Myrinet

Questions?Questions?

Current Tends in Parallel System Interconnects

Documents

Transcript of Current Tends in Parallel System Interconnects

Eitan Frachtenberg MIT, 20-Sep-2004 1 PAL Designing Parallel Operating Systems using Modern Interconnects CCS-3 Designing Parallel Operating Systems using.

VLSI Interconnects

Interconnects - SBC Tutorial

Class 5 Interconnects 1

Photonic interconnects

Silicides and Local Interconnects.....

Where Renewable Energy Interconnects

Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects Nikos Hardavellas PARAG@N – Parallel Architecture Group.

09/01/2011CS4961 CS4961 Parallel Programming Lecture 4: Memory Systems and Interconnects Mary Hall September 1, 2011 1.

Custom Interconnects

Scalable Reconfigurable Interconnects

Physical storage and interconnects

Semiconductor Nanowire Heteroepitaxy on Arbitraryyp ......Semiconductor Nanowire Heteroepitaxy on Arbitraryyp Substrates for Optoelectronic Devices and Massively Parallel Interconnects

© intec 2000 Reasons for parallel optical interconnects Roel Baets Ghent University - IMEC Department of Information Technology (INTEC)

Data Center Interconnects - LightCounting Center Interconnects.pdf · LIGHTCOUNTING Market Research on High-Speed Interconnects

Fiber Selection and Standards Guide for Premises Networks€¦ · networks. Fiber Selection and Standards Guide for Premises Networks WP1160 Issued: ... parallel-optic based interconnects.

I-SPAN’05 December 07, 2005 1 Process Scheduling for the Parallel Desktop Designing Parallel Operating Systems using Modern Interconnects Process Scheduling.

Product Overview - Board to Board Interconnects. Multiple Board Connection ... our selection of Board to Board Interconnects More Info . Title: Product Overview - Board to Board Interconnects

Galaxy: A High-Performance Energy-Efficient Multi-Chip Architecture Using Photonic Interconnects Nikos Hardavellas PARAG@N – Parallel Architecture Group.

Web Interconnects