RSISIPL1 SERVICE ORIENTED ARCHITECTURE (SOA) By Pavan By Pavan.
Extreme-Scale Interconnects and Impact on Applications€¦ · Pavan Balaji, Argonne National...
Transcript of Extreme-Scale Interconnects and Impact on Applications€¦ · Pavan Balaji, Argonne National...
Extreme-Scale Interconnects and Impact on Applications
Pavan Balaji
ComputerScientistandGroupLead
ProgrammingModelsandRuntimeSystemsGroup
MathematicsandComputerScienceDivision
ArgonneNationalLaboratory
PavanBalaji,ArgonneNationalLaboratory
U.S. DOE Potential System Architecture Targets
ATPESCWorkshop(08/01/2016)
Systemattributes 2010 2017-2018 2023-2024
System peak 2Peta 150-200Petaflop/sec 1Exaflop/sec
Power 6MW 15MW 20MW
Systemmemory 0.3 PB 5PB 32-64 PB
Nodeperformance 125GF 3TF 30TF 10TF 100TF
NodememoryBW 25 GB/s 0.1TB/sec 1TB/sec 0.4TB/sec 4TB/sec
Nodeconcurrency 12 O(100) O(1,000) O(1,000) O(10,000)
Systemsize(nodes) 18,700 50,000 5,000 100,000 10,000
Total NodeInterconnectBW 1.5GB/s 20GB/sec 200GB/sec
MTTI days O(1day) O(1day)
Currentproduction
PlannedUpgrades
(e.g.,CORAL)
ExascaleGoals
[IncludessomemodificationstotheDOEExascale report]
PavanBalaji,ArgonneNationalLaboratory
General Trends in System Architecture
§ Numberofnodesisincreasing,butatamoderatepace
§ Numberofcores/threadsonanodeisincreasingrapidly
§ Eachcoreisnotincreasinginspeed(clockfrequency)
§ Chiplogiccomplexitydecreasing(in-orderinstructions,nopipelining,nobranchprediction)
§ Whatdoesthismeanfornetworks?– Morecoreswilldrivethenetwork
– Moresharingofthenetworkinfrastructure
– Theaggregateamountofcommunicationfromeachnodewillincreasemoderately,butwillbedividedintomanysmallermessages
– Asinglecorewillnotbeabletodrivethenetworkfully
ATPESCWorkshop(08/01/2016)
PavanBalaji,ArgonneNationalLaboratory
A Simplified Network Architecture
§ Hardwarecomponents– Processingcoresand
memorysubsystem– I/Obusorlinks– Network
adapters/switches
§ Softwarecomponents– Communicationstack
§ Balancedapproachrequiredtomaximizeuser-perceivednetworkperformance
ATPESCWorkshop(08/01/2016)
P0Core0 Core1
Core2 Core3
P1Core0 Core1
Core2 Core3
Memory
MemoryI/OBus
NetworkAdapter
NetworkSwitch
ProcessingBottlenecks
I/OInterfaceBottlenecks
NetworkEnd-host
BottlenecksNetworkFabric
Bottlenecks
PavanBalaji,ArgonneNationalLaboratory
Agenda
ATPESCWorkshop(08/01/2016)
NetworkAdapters
NetworkTopologies
Network/Processor/MemoryInteractions
PavanBalaji,ArgonneNationalLaboratory
Bottlenecks on Traditional Network Adapters
§ Networkspeedssaturatedataround1Gbps– Featuresprovidedwerelimited
– Commoditynetworkswerenotconsideredscalableenoughforverylarge-scalesystems
ATPESCWorkshop(08/01/2016)
P0
Core0 Core1
Core2 Core3
P1
Core0 Core1
Core2 Core3
Memory
MemoryI/OBus
NetworkAdapter
NetworkSwitch
NetworkBottlenecks
Ethernet (1979 - ) 10 Mbit/sec
Fast Ethernet (1993 -) 100 Mbit/sec
Gigabit Ethernet (1995 -) 1000 Mbit /sec
ATM (1995 -) 155/622/1024 Mbit/sec
Myrinet (1993 -) 1 Gbit/sec
Fibre Channel (1994 -) 1 Gbit/sec
PavanBalaji,ArgonneNationalLaboratory
End-host Network Interface Speeds
§ Recentnetworktechnologiesprovidehighbandwidthlinks– InfiniBandEDRgives100Gbps pernetworklink
• Upcomingnetworksexpectedtoincreasethatbyseveralfold
– Multiplenetworklinksbecomingacommonplace• ORNLSummitandLLNLSierramachines,JapanesePostT2Kmachine• Torusstyleorothermulti-dimensionalnetworks
§ End-hostnetworkbandwidthis“mostly”nolongerconsideredamajortechnologicallimitation
§ Networklatencyisstillanissue– That’saharderproblemtosolve– limitedbyphysics,nottechnology
• Thereissomeroomtoimproveitincurrenttechnology(trimmingthefat)• Significanteffortinmakingsystemsdensersoastoreducenetworklatency
§ Otherimportantmetrics:messagerate,congestion,…
ATPESCWorkshop(08/01/2016)
PavanBalaji,ArgonneNationalLaboratory
Simple Network Architecture (past systems)
§ Processor,memory,networkarealldecoupled
ATPESCWorkshop(08/01/2016)
P0Core0 Core1
Core2 Core3
P1Core0 Core1
Core2 Core3
Memory
MemoryI/OBus
NetworkAdapter
NetworkSwitch
PavanBalaji,ArgonneNationalLaboratory
Integrated Memory Controllers (current systems)
§ Inthepast10yearsorso,memorycontrollershavebeenintegratedontotheprocessor
§ Primarypurposewasscalablememorybandwidth(NUMA)
§ Alsohelpsnetworkcommunication– Datatransferto/fromnetworkrequires
coordinationwithcaches
§ SeveralnetworkI/Otechnologiesexist– PCIe,HTX,NVLink– Expectedtoprovidehigherbandwidth
thanwhatnetworklinkswillhave
ATPESCWorkshop(08/01/2016)
P1
Core0 Core1
Core2 Core3
Memory
Memory
I/OBus
NetworkAdapterNetworkSwitch
MemoryController
P0
Core0 Core1
Core2 Core3
MemoryController
PavanBalaji,ArgonneNationalLaboratory
Integrated Network Adapters (future systems)
ATPESCWorkshop(08/01/2016)
NetworkSwitch
P0Core0 Core1
Core2 Core3
MemoryControllerNetworkInterface
Off-chipMemory
§ Severalvendorsareconsideringprocessor-integratednetworkadapters
§ Mayimprovenetworkbandwidth– UncleariftheI/Obuswouldbea
bottleneck
§ Improvesnetworklatencies– Controlmessagesbetweenthe
processor,network,andmemoryarenowon-chip
§ Improvesnetworkfunctionality– Communicationisafirst-classcitizen
andbetterintegratedwithprocessorfeatures
– E.g.,networkatomicoperationscanbeatomicwithrespecttoprocessoratomics
In-package
Mem
ory
P1Core0 Core1
Core2 Core3
MemoryControllerNetworkInterface
Off-chipMemory
In-package
Mem
ory
PavanBalaji,ArgonneNationalLaboratory
Processing Bottlenecks in Traditional Protocols
§ Ex:TCP/IP,UDP/IP
§ Genericarchitectureforallnetworks
§ Hostprocessorhandlesalmostallaspectsofcommunication– Databuffering(copiesonsenderand
receiver)
– Dataintegrity(checksum)
– Routingaspects(IProuting)
§ Signalingbetweendifferentlayers– Hardwareinterruptonpacketarrivalor
transmission
– Softwaresignalsbetweendifferentlayerstohandleprotocolprocessingindifferentprioritylevels
ATPESCWorkshop(08/01/2016)
P0
Core0 Core1
Core2 Core3
P1
Core0 Core1
Core2 Core3
Memory
MemoryI/OBus
NetworkAdapter
NetworkSwitch
ProcessingBottlenecks
PavanBalaji,ArgonneNationalLaboratory
Network Protocol Stacks: The Offload Era
§ Modernnetworksarespendingmoreandmorenetworkreal-estateonoffloadingvariouscommunicationfeaturesonhardware
§ Networkandtransportlayersarehardwareoffloadedformostmodernnetworks– Reliability(retransmissions,CRCchecks),packetization– OS-basedmemoryregistration,anduser-leveldatatransmission
ATPESCWorkshop(08/01/2016)
PavanBalaji,ArgonneNationalLaboratory
Comparing Offloaded Network Stacks with Traditional Network Stacks
ATPESCWorkshop(08/01/2016)
ApplicationLayerMPI, PGAS,FileSystems
TransportLayer
Low-levelinterface
Reliable/unreliableprotocols
LinkLayerFlow-control,ErrorDetection
PhysicalLayer
HardwareOffload
CopperorOptical
HTTP,FTP,MPI,
FileSystems
Routing
PhysicalLayer
LinkLayer
NetworkLayer
TransportLayer
ApplicationLayer
TraditionalEthernet
SocketsInterface
TCP,UDP
Flow-controland
ErrorDetection
Copper,OpticalorWireless
NetworkLayer Routing
Managementtools
DNSmanagementtools
PavanBalaji,ArgonneNationalLaboratory
Current State for Network APIs
§ AlargenumberofnetworkvendorspecificAPIs– InfiniBand verbs,Mellanox MXM,IBMPAMI,CrayGemini/DMAPP,…
§ Recenteffortstostandardizetheselow-levelcommunicationAPIs– OpenFabricsInterface(OFI)
• EffortfromIntel,CISCO,etc.,toprovideaunifiedlow-levelcommunicationlayerthatexposesfeaturesprovidedbyeachnetwork
– UnifiedCommunicationX(UCX)• EffortfromMellanox,IBM,ORNL,etc.,toprovideaunifiedlow-levelcommunicationlayerthatallowsforefficientMPIandPGAScommunication
– Portals-4• EffortfromSandiaNationalLaboratorytoprovideanetworkhardwarecapabilitycentricAPI
ATPESCWorkshop(08/01/2016)
PavanBalaji,ArgonneNationalLaboratory
User-level Communication: Memory Registration
ATPESCWorkshop(08/01/2016)
1. RegistrationRequest• Sendvirtualaddressandlength
2. Kernelhandlesvirtual->physicalmappingandpinsregionintophysicalmemory• Processcannotmapmemory
thatitdoesnotown(security!)
3. Networkadaptercachesthevirtualtophysicalmappingandissuesahandle
4. Handleisreturnedtoapplication
Beforewedoanycommunication:Allmemoryusedforcommunicationmust
beregistered
1
3
4
Process
Kernel
Network
2
PavanBalaji,ArgonneNationalLaboratory
Send/Receive Communication
ATPESCWorkshop(08/01/2016)
Network Interface
Memory Memory
Network Interface
Send Recv
MemorySegment
Sendentrycontainsinformationaboutthesendbuffer(multiplenon-contiguous
segments)
Processor Processor
Send Recv
MemorySegment
Receiveentrycontainsinformationonthereceivebuffer(multiplenon-contiguoussegments);Incomingmessageshavetobematchedtoa
receiveentrytoknowwheretoplace
Hardware ACK
MemorySegment
MemorySegment
MemorySegment
PavanBalaji,ArgonneNationalLaboratory
PUT/GET Communication
ATPESCWorkshop(08/01/2016)
NetworkInterface
Memory Memory
NetworkInterface
Send Recv
MemorySegment
Sendentrycontainsinformationaboutthesendbuffer(multiplesegments)andthe
receivebuffer(singlesegment)
Processor Processor
Send Recv
MemorySegment
HardwareACK
MemorySegment
MemorySegment
PavanBalaji,ArgonneNationalLaboratory
Atomic Operations
ATPESCWorkshop(08/01/2016)
NetworkInterface
Memory Memory
NetworkInterface
Send Recv
MemorySegment
Sendentrycontainsinformationaboutthesendbufferandthereceivebuffer
Processor Processor
Send Recv
SourceMemorySegment
OP
DestinationMemorySegment
PavanBalaji,ArgonneNationalLaboratory
Network Protocol Stacks: Specialization
§ Increasingnetworkspecializationisthefocustoday– Thenextgenerationofnetworksplantohavefurthersupportfor
noncontiguousdatamovement,andmultiplecontextsformultithreadedarchitectures
§ Somenetworks,suchastheBlueGenenetwork,CraynetworkandInfiniBand,arealsooffloadingsomeMPIandPGASfeaturesontohardware– E.g.,PUT/GETcommunicationhashardwaresupport– Increasingnumberofatomicoperationsbeingoffloadedtohardware
• Compare-and-swap,fetch-and-add,swap
– Collectiveoperations– Portals-basednetworksalsohadsupportforhardwarematchingforMPI
send/recv• CraySeastar,BullBMI
ATPESCWorkshop(08/01/2016)
PavanBalaji,ArgonneNationalLaboratory
Agenda
ATPESCWorkshop(08/01/2016)
NetworkAdapters
NetworkTopologies
Network/Processor/MemoryInteractions
PavanBalaji,ArgonneNationalLaboratory
Traditional Network Topologies: Crossbar
§ Anetworktopologydescribeshowdifferentnetworkadaptersandswitchesareinterconnectedwitheachother
§ Themostidealnetworktopology(forperformance)isacrossbar– Alltoall connectionbetweennetworkadapters
– TypicallydoneonasinglenetworkASIC
– CurrentnetworkcrossbarASICsgoupto64ports;tooexpensivetoscaletohigherportcounts
– Allcommunicationisnonblocking
ATPESCWorkshop(08/01/2016)
PavanBalaji,ArgonneNationalLaboratory
Traditional Network Topologies: Fat-tree
§ Themostcommontopologyforsmallandmediumscalesystemsisafat-tree– Nonblocking fat-treeswitchesavailableinabundance
• Allowsforpseudononblocking communication• Betweenallpairsofprocesses,thereexistsacompletelynonblockingpath,butnotallpathsarenonblocking
– Morescalablethancrossbars,butthenumberofnetworklinksstillincreasessuper-linearlywithnodecount• Cangetveryexpensivewithscale
ATPESCWorkshop(08/01/2016)
PavanBalaji,ArgonneNationalLaboratory
Network Topology Trends§ Moderntopologiesaremovingtowards
more“scalability”(withrespecttocost,notperformance)
§ BlueGene,CrayXE/XK,andKsupercomputersuseatorus-network;CrayXCusesdragonfly– Linearincreaseinthenumberof
links/routerswithsystemsize– Anycommunicationthatismorethanone
hopawayhasapossibilityofinterference–congestionisnotjustpossible,butcommon
– Evenwhenthereisnocongestion,suchtopologiesincreasethenetworkdiametercausingperformanceloss
§ Take-away:topologicallocalityisimportantanditsnotgoingtogetbetter
ATPESCWorkshop(08/01/2016)
PavanBalaji,ArgonneNationalLaboratory
Network Congestion Behavior: IBM BG/P
0
500
1000
1500
2000
2500
3000
3500
Band
width(M
bps)
MessageSize(bytes)
P2-P5P3-P4Nooverlap
ATPESCWorkshop(08/01/2016)
P0 P1 P2 P3 P4 P5 P6 P7
PavanBalaji,ArgonneNationalLaboratory
2D Nearest Neighbor: Process Mapping (XYZ)
X-AxisZ-Axis
Y-Axis
ATPESCWorkshop(08/01/2016)
PavanBalaji,ArgonneNationalLaboratory
Nearest Neighbor Performance: IBM BG/P
ATPESCWorkshop(08/01/2016)
0
100
200
300
400
500
600
700
800
900
2 4 8 16 32 64 128 256 512 1K
ExecutionTime(us)
GridPartition(bytes)
SystemSize:16KCores
XYZTTXYZZYXTTZYX
0
500
1000
1500
2000
2500
2 4 8 16 32 64 128 256 512 1K
ExecutionTime(us)
GridPartition(bytes)
SystemSize:128KCores
XYZTTXYZZYXTTZYX
2DHaloExchange
PavanBalaji,ArgonneNationalLaboratory
Agenda
ATPESCWorkshop(08/01/2016)
NetworkAdapters
NetworkTopologies
Network/Processor/MemoryInteractions
PavanBalaji,ArgonneNationalLaboratory
Network Interactions with Memory/Cache
§ Mostnetworkinterfacesunderstandandworkwiththecachecoherenceprotocolsavailableonmodernsystems– Usersdonothavetoensurethatdataisflushedfromcachebefore
communication
– Networkandmemorycontrollerhardwareunderstandwhatstatethedataisinandcommunicateappropriately
ATPESCWorkshop(08/01/2016)
PavanBalaji,ArgonneNationalLaboratory
Send-side Network Communication
ATPESCWorkshop(08/01/2016)
NorthBridgeCPU
NIC
FSB MemoryBus
I/OBus
Appln.Buffer
L3$ Memory
Appln.Buffer
PavanBalaji,ArgonneNationalLaboratory
Receive-side Network Communication
ATPESCWorkshop(08/01/2016)
NorthBridgeCPU
NIC
FSB MemoryBus
I/OBus
Appln.Buffer
L3$ Memory
Appln.Buffer
PavanBalaji,ArgonneNationalLaboratory
Network/Processor Interoperation Trends
§ Directcacheinjection– Mostcurrentnetworksinjectdataintomemory
• Ifdataisincache,theyflushcacheandtheninjecttomemory
– Somenetworksareinvestigatingdirectcacheinjection• Datacanbeinjecteddirectlyintothelast-levelcache
• Canbetrickysinceitcancausecachepollutioniftheincomingdataisnotusedimmediately
§ Atomicoperations– Currentnetworkatomicoperationsareonlyatomicwithrespectto
othernetworkoperationsandnotwithrespecttoprocessoratomics• E.g.,networkfetch-and-addandprocessorfetch-and-addmightcorrupteachother’sdata
– Withnetwork/processorintegration,thisisexpectedtobefixed
ATPESCWorkshop(08/01/2016)
PavanBalaji,ArgonneNationalLaboratory
Summary
§ Withevery10Xincreaseinperformance,somethingbreaks!
§ Inthepast20years,wehaveenjoyedamillion-foldincreaseinperformance– Patch-workateverystepisnotgoingtocutitatthispace
– Weneedtolookforwardforwhat’snextintechnologyandthinkabouthowtoutilizeit
§ Wearelookingatanother5-6Xincreaseinperformanceoverthenext6-8years
§ Theseareinterestingtimesforallcomponentsintheoverallsystemarchitecture:processor,memory,interconnect– Andinterestingtimesforcomputationalscienceonthesesystems!
ATPESCWorkshop(08/01/2016)