Extreme-Scale Interconnects and Impact on Applications€¦ · Pavan Balaji, Argonne National...

Extreme-Scale Interconnects and Impact on Applications

Pavan Balaji

ComputerScientistandGroupLead

ProgrammingModelsandRuntimeSystemsGroup

MathematicsandComputerScienceDivision

ArgonneNationalLaboratory

PavanBalaji,ArgonneNationalLaboratory

U.S. DOE Potential System Architecture Targets

ATPESCWorkshop(08/01/2016)

Systemattributes 2010 2017-2018 2023-2024

System peak 2Peta 150-200Petaflop/sec 1Exaflop/sec

Power 6MW 15MW 20MW

Systemmemory 0.3 PB 5PB 32-64 PB

Nodeperformance 125GF 3TF 30TF 10TF 100TF

NodememoryBW 25 GB/s 0.1TB/sec 1TB/sec 0.4TB/sec 4TB/sec

Nodeconcurrency 12 O(100) O(1,000) O(1,000) O(10,000)

Systemsize(nodes) 18,700 50,000 5,000 100,000 10,000

Total NodeInterconnectBW 1.5GB/s 20GB/sec 200GB/sec

MTTI days O(1day) O(1day)

Currentproduction

PlannedUpgrades

(e.g.,CORAL)

ExascaleGoals

[IncludessomemodificationstotheDOEExascale report]


General Trends in System Architecture

§ Numberofnodesisincreasing,butatamoderatepace

§ Numberofcores/threadsonanodeisincreasingrapidly

§ Eachcoreisnotincreasinginspeed(clockfrequency)

§ Chiplogiccomplexitydecreasing(in-orderinstructions,nopipelining,nobranchprediction)

§ Whatdoesthismeanfornetworks?– Morecoreswilldrivethenetwork

– Moresharingofthenetworkinfrastructure

– Theaggregateamountofcommunicationfromeachnodewillincreasemoderately,butwillbedividedintomanysmallermessages

– Asinglecorewillnotbeabletodrivethenetworkfully



A Simplified Network Architecture

§ Hardwarecomponents– Processingcoresand

memorysubsystem– I/Obusorlinks– Network

adapters/switches

§ Softwarecomponents– Communicationstack

§ Balancedapproachrequiredtomaximizeuser-perceivednetworkperformance


P0Core0 Core1

Core2 Core3

P1Core0 Core1

Core2 Core3

Memory

MemoryI/OBus

NetworkAdapter

NetworkSwitch

ProcessingBottlenecks

I/OInterfaceBottlenecks

NetworkEnd-host

BottlenecksNetworkFabric

Bottlenecks


Agenda


NetworkAdapters

NetworkTopologies

Network/Processor/MemoryInteractions


Bottlenecks on Traditional Network Adapters

§ Networkspeedssaturatedataround1Gbps– Featuresprovidedwerelimited

– Commoditynetworkswerenotconsideredscalableenoughforverylarge-scalesystems


P0

Core0 Core1

Core2 Core3

P1

Core0 Core1

Core2 Core3

Memory

MemoryI/OBus

NetworkAdapter

NetworkSwitch

NetworkBottlenecks

Ethernet (1979 - ) 10 Mbit/sec

Fast Ethernet (1993 -) 100 Mbit/sec

Gigabit Ethernet (1995 -) 1000 Mbit /sec

ATM (1995 -) 155/622/1024 Mbit/sec

Myrinet (1993 -) 1 Gbit/sec

Fibre Channel (1994 -) 1 Gbit/sec


End-host Network Interface Speeds

§ Recentnetworktechnologiesprovidehighbandwidthlinks– InfiniBandEDRgives100Gbps pernetworklink

• Upcomingnetworksexpectedtoincreasethatbyseveralfold

– Multiplenetworklinksbecomingacommonplace• ORNLSummitandLLNLSierramachines,JapanesePostT2Kmachine• Torusstyleorothermulti-dimensionalnetworks

§ End-hostnetworkbandwidthis“mostly”nolongerconsideredamajortechnologicallimitation

§ Networklatencyisstillanissue– That’saharderproblemtosolve– limitedbyphysics,nottechnology

• Thereissomeroomtoimproveitincurrenttechnology(trimmingthefat)• Significanteffortinmakingsystemsdensersoastoreducenetworklatency

§ Otherimportantmetrics:messagerate,congestion,…



Simple Network Architecture (past systems)

§ Processor,memory,networkarealldecoupled


P0Core0 Core1

Core2 Core3

P1Core0 Core1

Core2 Core3

Memory

MemoryI/OBus

NetworkAdapter

NetworkSwitch


Integrated Memory Controllers (current systems)

§ Inthepast10yearsorso,memorycontrollershavebeenintegratedontotheprocessor

§ Primarypurposewasscalablememorybandwidth(NUMA)

§ Alsohelpsnetworkcommunication– Datatransferto/fromnetworkrequires

coordinationwithcaches

§ SeveralnetworkI/Otechnologiesexist– PCIe,HTX,NVLink– Expectedtoprovidehigherbandwidth

thanwhatnetworklinkswillhave


P1

Core0 Core1

Core2 Core3

Memory

Memory

I/OBus

NetworkAdapterNetworkSwitch

MemoryController

P0

Core0 Core1

Core2 Core3

MemoryController


Integrated Network Adapters (future systems)


NetworkSwitch

P0Core0 Core1

Core2 Core3

MemoryControllerNetworkInterface

Off-chipMemory

§ Severalvendorsareconsideringprocessor-integratednetworkadapters

§ Mayimprovenetworkbandwidth– UncleariftheI/Obuswouldbea

bottleneck

§ Improvesnetworklatencies– Controlmessagesbetweenthe

processor,network,andmemoryarenowon-chip

§ Improvesnetworkfunctionality– Communicationisafirst-classcitizen

andbetterintegratedwithprocessorfeatures

– E.g.,networkatomicoperationscanbeatomicwithrespecttoprocessoratomics

In-package

Mem

ory

P1Core0 Core1

Core2 Core3

MemoryControllerNetworkInterface

Off-chipMemory

In-package

Mem

ory


Processing Bottlenecks in Traditional Protocols

§ Ex:TCP/IP,UDP/IP

§ Genericarchitectureforallnetworks

§ Hostprocessorhandlesalmostallaspectsofcommunication– Databuffering(copiesonsenderand

receiver)

– Dataintegrity(checksum)

– Routingaspects(IProuting)

§ Signalingbetweendifferentlayers– Hardwareinterruptonpacketarrivalor

transmission

– Softwaresignalsbetweendifferentlayerstohandleprotocolprocessingindifferentprioritylevels


P0

Core0 Core1

Core2 Core3

P1

Core0 Core1

Core2 Core3

Memory

MemoryI/OBus

NetworkAdapter

NetworkSwitch

ProcessingBottlenecks


Network Protocol Stacks: The Offload Era

§ Modernnetworksarespendingmoreandmorenetworkreal-estateonoffloadingvariouscommunicationfeaturesonhardware

§ Networkandtransportlayersarehardwareoffloadedformostmodernnetworks– Reliability(retransmissions,CRCchecks),packetization– OS-basedmemoryregistration,anduser-leveldatatransmission



Comparing Offloaded Network Stacks with Traditional Network Stacks


ApplicationLayerMPI, PGAS,FileSystems

TransportLayer

Low-levelinterface

Reliable/unreliableprotocols

LinkLayerFlow-control,ErrorDetection

PhysicalLayer

HardwareOffload

CopperorOptical

HTTP,FTP,MPI,

FileSystems

Routing

PhysicalLayer

LinkLayer

NetworkLayer

TransportLayer

ApplicationLayer

TraditionalEthernet

SocketsInterface

TCP,UDP

Flow-controland

ErrorDetection

Copper,OpticalorWireless

NetworkLayer Routing

Managementtools

DNSmanagementtools


Current State for Network APIs

§ AlargenumberofnetworkvendorspecificAPIs– InfiniBand verbs,Mellanox MXM,IBMPAMI,CrayGemini/DMAPP,…

§ Recenteffortstostandardizetheselow-levelcommunicationAPIs– OpenFabricsInterface(OFI)

• EffortfromIntel,CISCO,etc.,toprovideaunifiedlow-levelcommunicationlayerthatexposesfeaturesprovidedbyeachnetwork

– UnifiedCommunicationX(UCX)• EffortfromMellanox,IBM,ORNL,etc.,toprovideaunifiedlow-levelcommunicationlayerthatallowsforefficientMPIandPGAScommunication

– Portals-4• EffortfromSandiaNationalLaboratorytoprovideanetworkhardwarecapabilitycentricAPI



User-level Communication: Memory Registration


1. RegistrationRequest• Sendvirtualaddressandlength

2. Kernelhandlesvirtual->physicalmappingandpinsregionintophysicalmemory• Processcannotmapmemory

thatitdoesnotown(security!)

3. Networkadaptercachesthevirtualtophysicalmappingandissuesahandle

4. Handleisreturnedtoapplication

Beforewedoanycommunication:Allmemoryusedforcommunicationmust

beregistered

1

3

4

Process

Kernel

Network

2


Send/Receive Communication


Network Interface

Memory Memory

Network Interface

Send Recv

MemorySegment

Sendentrycontainsinformationaboutthesendbuffer(multiplenon-contiguous

segments)

Processor Processor

Send Recv

MemorySegment

Receiveentrycontainsinformationonthereceivebuffer(multiplenon-contiguoussegments);Incomingmessageshavetobematchedtoa

receiveentrytoknowwheretoplace

Hardware ACK

MemorySegment

MemorySegment

MemorySegment


PUT/GET Communication


NetworkInterface

Memory Memory

NetworkInterface

Send Recv

MemorySegment

Sendentrycontainsinformationaboutthesendbuffer(multiplesegments)andthe

receivebuffer(singlesegment)

Processor Processor

Send Recv

MemorySegment

HardwareACK

MemorySegment

MemorySegment


Atomic Operations


NetworkInterface

Memory Memory

NetworkInterface

Send Recv

MemorySegment

Sendentrycontainsinformationaboutthesendbufferandthereceivebuffer

Processor Processor

Send Recv

SourceMemorySegment

OP

DestinationMemorySegment


Network Protocol Stacks: Specialization

§ Increasingnetworkspecializationisthefocustoday– Thenextgenerationofnetworksplantohavefurthersupportfor

noncontiguousdatamovement,andmultiplecontextsformultithreadedarchitectures

§ Somenetworks,suchastheBlueGenenetwork,CraynetworkandInfiniBand,arealsooffloadingsomeMPIandPGASfeaturesontohardware– E.g.,PUT/GETcommunicationhashardwaresupport– Increasingnumberofatomicoperationsbeingoffloadedtohardware

• Compare-and-swap,fetch-and-add,swap

– Collectiveoperations– Portals-basednetworksalsohadsupportforhardwarematchingforMPI

send/recv• CraySeastar,BullBMI



Agenda


NetworkAdapters

NetworkTopologies



Traditional Network Topologies: Crossbar

§ Anetworktopologydescribeshowdifferentnetworkadaptersandswitchesareinterconnectedwitheachother

§ Themostidealnetworktopology(forperformance)isacrossbar– Alltoall connectionbetweennetworkadapters

– TypicallydoneonasinglenetworkASIC

– CurrentnetworkcrossbarASICsgoupto64ports;tooexpensivetoscaletohigherportcounts

– Allcommunicationisnonblocking



Traditional Network Topologies: Fat-tree

§ Themostcommontopologyforsmallandmediumscalesystemsisafat-tree– Nonblocking fat-treeswitchesavailableinabundance

• Allowsforpseudononblocking communication• Betweenallpairsofprocesses,thereexistsacompletelynonblockingpath,butnotallpathsarenonblocking

– Morescalablethancrossbars,butthenumberofnetworklinksstillincreasessuper-linearlywithnodecount• Cangetveryexpensivewithscale



Network Topology Trends§ Moderntopologiesaremovingtowards

more“scalability”(withrespecttocost,notperformance)

§ BlueGene,CrayXE/XK,andKsupercomputersuseatorus-network;CrayXCusesdragonfly– Linearincreaseinthenumberof

links/routerswithsystemsize– Anycommunicationthatismorethanone

hopawayhasapossibilityofinterference–congestionisnotjustpossible,butcommon

– Evenwhenthereisnocongestion,suchtopologiesincreasethenetworkdiametercausingperformanceloss

§ Take-away:topologicallocalityisimportantanditsnotgoingtogetbetter



Network Congestion Behavior: IBM BG/P

0

500

1000

1500

2000

2500

3000

3500

Band

width(M

bps)

MessageSize(bytes)

P2-P5P3-P4Nooverlap


P0 P1 P2 P3 P4 P5 P6 P7


2D Nearest Neighbor: Process Mapping (XYZ)

X-AxisZ-Axis

Y-Axis



Nearest Neighbor Performance: IBM BG/P


0

100

200

300

400

500

600

700

800

900

2 4 8 16 32 64 128 256 512 1K

ExecutionTime(us)

GridPartition(bytes)

SystemSize:16KCores

XYZTTXYZZYXTTZYX

0

500

1000

1500

2000

2500

2 4 8 16 32 64 128 256 512 1K

ExecutionTime(us)

GridPartition(bytes)

SystemSize:128KCores

XYZTTXYZZYXTTZYX

2DHaloExchange


Agenda


NetworkAdapters

NetworkTopologies



Network Interactions with Memory/Cache

§ Mostnetworkinterfacesunderstandandworkwiththecachecoherenceprotocolsavailableonmodernsystems– Usersdonothavetoensurethatdataisflushedfromcachebefore

communication

– Networkandmemorycontrollerhardwareunderstandwhatstatethedataisinandcommunicateappropriately



Send-side Network Communication


NorthBridgeCPU

NIC

FSB MemoryBus

I/OBus

Appln.Buffer

L3$ Memory

Appln.Buffer


Receive-side Network Communication


NorthBridgeCPU

NIC

FSB MemoryBus

I/OBus

Appln.Buffer

L3$ Memory

Appln.Buffer


Network/Processor Interoperation Trends

§ Directcacheinjection– Mostcurrentnetworksinjectdataintomemory

• Ifdataisincache,theyflushcacheandtheninjecttomemory

– Somenetworksareinvestigatingdirectcacheinjection• Datacanbeinjecteddirectlyintothelast-levelcache

• Canbetrickysinceitcancausecachepollutioniftheincomingdataisnotusedimmediately

§ Atomicoperations– Currentnetworkatomicoperationsareonlyatomicwithrespectto

othernetworkoperationsandnotwithrespecttoprocessoratomics• E.g.,networkfetch-and-addandprocessorfetch-and-addmightcorrupteachother’sdata

– Withnetwork/processorintegration,thisisexpectedtobefixed



Summary

§ Withevery10Xincreaseinperformance,somethingbreaks!

§ Inthepast20years,wehaveenjoyedamillion-foldincreaseinperformance– Patch-workateverystepisnotgoingtocutitatthispace

– Weneedtolookforwardforwhat’snextintechnologyandthinkabouthowtoutilizeit

§ Wearelookingatanother5-6Xincreaseinperformanceoverthenext6-8years

§ Theseareinterestingtimesforallcomponentsintheoverallsystemarchitecture:processor,memory,interconnect– Andinterestingtimesforcomputationalscienceonthesesystems!


Thank You!

Email:[email protected]

Web:http://www.mcs.anl.gov/~balaji

Extreme-Scale Interconnects and Impact on Applications€¦ · Pavan Balaji, Argonne National...

Documents

Transcript of Extreme-Scale Interconnects and Impact on Applications€¦ · Pavan Balaji, Argonne National...