D2.1 Initial Report on Exascale Technology State of the...

The opinions of the authors expressed in this document do not necessarily reflect the official opinion of the ExaFLOW partners nor of the European Commission.

H2020FETHPC-1-2014

EnablingExascaleFluidDynamicsSimulationsProjectNumber671571

D2.1–InitialReportonExascaleTechnologyState-of-the-Art

WP2:EfficiencyImprovementsTowardsExascale

D2.1–InitialReportonExascaleTechnologyState-of-the-Art 2

Copyright©2016TheExaFLOWConsortium

DocumentInformationDeliverableNumber D2.1DeliverableName InitialReportonExascaleTechnologyState-of-the-ArtDueDate 31/05/2016(PM8)DeliverableLead UEDINAuthors N.Johnson,UEDINResponsibleAuthor N.Johnson,UEDINKeywords Exascale,Technologies,ProgrammingWP WP2Nature RDisseminationLevel PUFinalVersionDate 31/05/2016

Reviewedby PhilippSchlatter,KTHUlrichRist,Univ.Stuttgart

MGTBoardApproval 31/05/2016


DocumentHistoryPartner Date Comments Version

UEDIN 29/04/2016 Firstdraft 0.1UEDIN 02/05/2016 Seconddraft 0.2UEDIN 18/05/2016 IntegratecommentsfromUSTUTT 0.5UEDIN 26/05/2016 Finaledits 0.6KTH 31/05/2016 FinalizationafterPMBreview 1.0


ExecutiveSummaryCurrent HPC technology is unable to scale to a sufficient degree to allowconstruction of a feasible exascale machine. Using current technology, such amachinewouldconsumeatleast7timesthe20MWtargetpowerlimit.Ifthisisignored,thenitwouldbepossibletoconstructsuchamachine.Howevernojobislikelyabletomakeefficientuseofawholemachinefortworeasons:faults,andsoftwarescaling.Withrespecttothis,scalingcurrenttechnologywouldleadtoahigh number of faults, i.e., where a processor fails in some manner that isunrecoverablebythehardwarealone.Theapplicationwouldhavetoberobustto this, most likely by using fault tolerant methods, as check pointing acomputationonsuchamachinemayproveprohibitivelyexpensive.Looking forward, emerging technology will develop to help reduce currentbottlenecks: latencies will fall, both between memory and processor andbetweennodes;energyefficiencywillincrease,bringingthepowertargetwithinreach, andprogrammingmodelswill evolve toeitherprovideorallowsimplerconstructionofscalablefaulttolerantmethods.This deliverable examines some current technologies and projects workingtowards the Exascale to see how a machine implementing them or theirsuccessorsmight look, andwhere thebottlenecksmight lie in suchamachine.For example, using die-stacked memory (where the memory and processorinhabit the same physical package, if not the same die) reduces the energyrequired tomovedata, alongwith the latency incurred.This,however, impliesthat off-package communication is relatively more expensive than at present,unless interconnect technology can keep pace. Consequently, algorithm designmayhaveto focusmoreoncommunicationsavoidancetoreducethe impactofthesepenalties.


Contents1 Acronym&Abbreviations.....................................................................................................52 Introduction.................................................................................................................................63 Buildinganexascalesystemwithcurrenttechnology..............................................64 Emergenttechnologies...........................................................................................................74.1 DieStackedMemory......................................................................................................74.2 HighBandwidthMemory.............................................................................................74.3 HybridMemoryCube.....................................................................................................84.4 SystemonChipprocessors.........................................................................................84.5 Faulttolerantalgorithms.............................................................................................84.6 Scalableprogrammingmodels..................................................................................9

5 Whatwouldasystemlooklikeifitusedallsuchtechnologies?..........................96 ConclusionandFutureWork............................................................................................10

1 Acronyms&AbbreviationsAcronym MeaningCAF Co-arrayFortranCOTS CommercialOffTheShelfCPU CentralProcessingUnitDDR4 DoubleDataRate4thGenerationDRAM DynamicRandomAccessMemoryDSM DieStackedMemoryECC ErrorCorrectingCodeFlop FloatingPointOperationFPGA FieldProgrammableGateArrayFT Faulttolerant/faulttoleranceGPGPU GeneralPurposeGraphicsProcessingUnitHBM HighBandwidthMemoryHMC HybridMemoryCubeHPC HighPerformanceComputingL1/L2 Level1/Level2CacheLLC LastLevelCacheMPI MessagePassingInterfaceOpenMP OpenMulti-ProcessingPCIe PeripheralComponentInterconnectExpressPE ProcessingElementPGAS PartitionedGlobalAddressSpaceSoC SystemonChipTSV ThroughSiliconViaUPC UnifiedParallelC


2 IntroductionThe purpose of this deliverable is to describe some of the current technologytrends in HPC and project these trends forward to imagine what an exascalesystem based on these technologies would look like, what performancecharacteristicsitmighthave,andwhatbottlenecksitmaydisplay.

3 BuildinganexascalesystemwithcurrenttechnologyInordertoprovideameaningfulcomparison,afictitiousexascalesystemnamedFictision, which is based upon current COTS technologies, is described. Thisallowsustoconsiderhowfarcurrenttechnologyneedstoadvancetoreachtheexascale.Fictision isbasedon thesametechnologyasARCHER(ARCHER,2016), theUKNational Facility operatedbyEPCC.ARCHERhas4920nodes, each comprisingtwoIntelXeonCPUs.Ithasatheoreticalmaximumperformanceof2.55PFlop/sandameasuredmaximumof1.64PFlop/s-thus,ithasaratiooftheoreticaltomeasured performance of 1.55. It consumes 1,200 kW of power at full load,givingadeliveredenergyefficiencyof1.368GFlop/W.Basedonthesecharacteristics,ifFictisionweretobebuiltusingthistechnology,in order to achieve a measured performance of 1 EFlop/s, it would requirehardwarecapableof1.55EFlop/s.Asinglenodehasatheoreticalperformanceof 518.4 GFlop/s1and, therefore, Fictision requires 2,989,970 nodes. UsingARCHER’spowerfigures,Fictisionrequires729MWofpoweratfullload.As a comparison, Shoubu, the leader of the Green 500 list (a list of the mostefficientsupercomputers),hasapowerefficiencyof7.031GFlops/W.Usingthisefficiency, an exascale machine would draw 142 MW.Whilst this is less thanFictision, it is still greater than the 20 MW target with a per flop energy-efficiencydeficitof7x.Given sufficient financial resources, it would be feasible to manufacture thehardwarerequiredtobuildanexascalemachine.Thesamewouldbetrueforthecooling and space required to host such a machine, and for the electricityrequiredtopowerit.Whatislessclearisifitwouldbeofpracticaluse.Currentcodes are unlikely to be able to exploit the fullwidth of themachine, i.e. theywouldnotbeabletoscalesufficientlywelltomakeefficientuseoftheavailablecomputepower. Ifacodecouldscale to thewholemachine, itwould likelyseehardware and software faults during each run. Unless fault tolerance wasimplemented, either by the programming model, program or algorithm, thiswouldlikelyresultinaverylowrateofsuccessfuljobcompletion.Morelikelyisthat such a machine would be a capacity machine rather than a capability

1WeassumetheCPUrunsfixedatitsmaximumspeedof2.7GHzandexecutes8instructionspercycle.


machine, running a selection of jobs of different sizes, much like currentmachines-exceptthatnojobwouldberunusingallavailableresources.

4 EmergenttechnologiesInthissection,weconsidersomeemergingtechnologiesandconsiderhowtheywillhelptoovercomesomeofthe limitationsofcurrenttechnology,whichwillberequiredtoreachapracticalanduseableexascalesystem.

4.1 DieStackedMemoryDie Stacked Memory (DSM) is the process whereby Dynamic Random AccessMemory(DRAM)isplacedascloseaspossibletotheprocessingelements(PE)inachip.Therearesomevariationsintechnology:onewheretheDRAMisonthesamedieasthePEwiththeinter-connectviathesubstrate;andanotherwheretheyareseparatewaferspackagedtogether,withathrough-siliconvia(TSV)asthe interconnect. The former gives a higher bandwidth and lower latency butrequiresadesignfreezeearlierintheprocess,asthewholechipmustbelaidoutasasinglepart.Thelatterallowsfordesignchangestoasingle layerbuthasa(comparatively) lower bandwidth and higher latency, as communication stilloccursasifbetweenchips(asittechnicallyis),althoughwithanimprovedspeedcomparedtothatofcurrentPEtoDRAM.The implication is that the latency seenbetweenPE andDRAMdecreases andmovesmuchclosertothatofPEandlast-levelcache(LLC).ThelatencypenaltyforadataorinstructionmisstoDRAMoneitherareadorwriteismuchlower,which implies PEs should have a lower stall time.With a subsequently highermemory bandwidth, memory bound codes should perform better where thememoryrestrictionisto localmemory.Anyoff-chipcommunication, i.e.viathePE interconnect, will be comparatively more costly. Should the ratio ofcomputation to datamovement (with respect toDRAM)be suitable, itmaybepossibletoavoidtheon-diecachehierarchyentirelyandusetheDRAMasaflatmemoryhierarchy.Thismayofferanadvantagewherethedatasetistoolargetohold in cache and the hardware pre-fetcher is ineffective with respect tocachelines.

4.2 HighBandwidthMemoryHigh bandwidth memory (JEDEC, 2015) (HBM), is a memory interfacespecificationandnaturalconsequenceofDSM.When4thgenerationdoubledatarate memory (DDR4) is integrated into the package or die (via DSM) theadditionalmemorybandwidthcanbeexploitedtoreducethe limitationsof theDRAMbus.


4.3 HybridMemoryCubeHybrid memory cube (Hybrid Memory Cube Consortium, 2012) (HMC) is thecompeting interface to HBM from Intel which follows the same principles ofusingfastmodernDDR4coupledtotheCPUviaTSVs.

4.4 SystemonChipprocessorsIn-packageheterogeneitygivesrisetoheterogeneousarchitectures,i.e.differenttypesofprocessingelementsuchasGPUcores,CPUcoresandFPGAfabricbeingintegrated into the same package, if not the same die. Current models forcompute offload,where certain computational kernels are passed to a PE thathasbetterperformanceforthekernelthantheCPUcore,mustbewaryaboutthetimetakentotransferdata,andthekernel,toandfromthePE.Evenatthefullspeed of a 3rd generation Peripheral Component Interconnect Express (PCIe)bus,theeffectisnon-negligible.BringingtheseelementsclosertotheCPUcoresallowsmoreefficientoffload,astransfertimetoandfromthePEisreduced.Thedisadvantage is that the packagemust be larger in order to accommodate theextraelements.This leads toeitherasmallerareaavailable forCPUcoresorahigher density, which may increase power consumption and heat generation.Furthermore, as the manufacturing process may be more complex toaccommodate the increased chip complexity, the likely yield from productionwilldrop.This could increasecostsor result invariabilitybetweenchips,withsomehavingdifferent performance thanothers. The gambleplayed is that theefficiency gained by using a PE better matched to the computational kernelnegates the loss of the generalist CPU cores, and consequently the overallelectricalefficiencyisbetterthanachipofpureCPUcores.An exascale machine built using this technology would have a high potentialcomputationaldensity(flops/mm2),butwouldbeabletorealisethisonlywhereboththealgorithmandprogrammingmodelwereablemakeefficientuseofthenon-CPUcores.

4.5 FaulttolerantalgorithmsAs the number of PEs increases - regardless of their type - the likelihood of aphysical,hardwarefaultoccurringinamachineincreases.Anexascalemachinewill see regular hardware faults, some of which can be corrected by thehardware itself (e.g. error correcting code (ECC) DRAM), and the rest, whichcannot be corrected, will percolate up through the software stack and maybecome evident at the application layer. A small-scale fault, such as a non-correctablebit-flipinmemory,mayleadtoabadvaluebeingfedtoaPEonload.AlargefaultmaybeoneinwhichaPEornodefailsandpowersdown.Thepartofthecalculationbeingrunonthiselementislostandtheworkmustbere-done.Somealgorithmsmay require restarting from themost recent checkpoint; thismay incur a significant time penalty if checkpoints arewidely spaced in time,which is not unreasonable if the checkpoint itself incurs a penalty to create.However, it ispossible forsomealgorithmstocontinueandgracefullyrecover,perhapstakinglongertocomplete;thesealgorithmsarecollectivelycalledfault


tolerant (FT)algorithms.Anymachinebuilt at theexascale levelwill requireadegreeofFTintheprogrammingmodel,runtimeoralgorithmtocopewiththeunavoidablefaultswhichwilloccurduetothescaleofthehardware.

4.6 ScalableprogrammingmodelsIt is highly unlikely that any of the programming models used on currentpetascalemachineswillnot runonexascalemachines.Thedifficultywill comefromensuringtheymakeefficientuseoftheavailablehardware.Sharedmemorymodels such as OpenMP and Pthreads will have to be able to handle on-dieheterogeneity. OpenMP and OpenACC currently have a model for this type ofworking which may extend to meet the hardware. There is some element ofchicken-and-egg here, where the model’s extension will adapt to meet thehardwareasitbecomesavailable.Distributedmemoryapproaches,mostnotablythe message passing interface (MPI), will have to be able to scale to O(106)nodes,assumingthatanyintra-nodethreadingishandledbyanothermodel,orpossibly O(107) processes if a code is written using purely MPI. PartitionedGlobalAddressSpace(PGAS)implementationssuchasChapel,UnifiedParallelC(UPC), GASPI and Co-array Fortran (CAF) will face the same limitations. Theadvantage they may have over the purely distributed memory models is thatthey already include a notion of memory affinity, which may be adapted orextended to be aware of the difference between on-die and off-die/off-nodememory.

5 What would a system look like if it used all suchtechnologies?

Basedonthetechnologiesabove,anexascalemachinecouldlookquitedifferenttocurrentarchitectures.Thebottleneckswhichdominatethesystemwillbetheenergyrequiredandlatencyincurredinmovingdatabetweennodes.Nodeswillbeacollectionofheterogeneousprocessors:somegeneralpurpose,liketoday’sCPUs,perhapsofvaryingsizes;somemorespecialised(suchasSIMDunits)liketodays GPGPUs; and some configurable cores akin to FPGA fabrics. Closelycoupledtothiswillbeveryhighbandwidth, lowlatencyDRAMwhichactsasamemoryfordatastorageviathenormalcachemechanismandabufferforI/O.Configuration of thismay be dynamic and flexible, allowing a runtime to bestmanagetheresource.The programming models will be more complex. In order to manage such adiverse set of processors available on each node, smart runtime systems willoffloadthecomputationalkernelstothemostsuitableavailableprocessor.Theymaymanageschedulingofsuchkernelseitherautomatically,orwithsomelevelofdirectionfromtheuser.Thechoicewill largelydependonwhichmetricstheuserranksmosthighly:energyefficiency,timetosolution,peakpower,orbestutilisationofnodes.


6 ConclusionandFutureWorkThis deliverable has highlighted some of the currently emerging technologieswhich will go someway to delivering a feasible, usable and efficient exascalemachine in the next 10 to 15 year timescale. The exact construction, withrespecttohardware,software,programmingmodelandalgorithmsislikelytobeanevolutionofthatusedincurrentsystemswithdifferentbottlenecks,andthusdifferenttargetsforoptimisationateachlevel.

7 BibliographyARCHER. (2016, May 2). ARCHER hardware guide. Retrieved from ARCHER

website:http://www.archer.ac.uk/about-archer/hardware/Hybrid Memory Cube Consortium. (2012). Hybrid Memory Cube Specification

1.0.JEDEC.(2015,Novermber1).HIGHBANDWIDTHMEMORY(HBM)DRAM.

D2.1 Initial Report on Exascale Technology State of the...

Documents

Transcript of D2.1 Initial Report on Exascale Technology State of the...