Addressing Emerging Challenges in Designing HPC Runtimes

AddressingEmergingChallengesinDesigningHPCRun5mes:Energy-Awareness,AcceleratorsandVirtualiza5on

DhabaleswarK.(DK)PandaTheOhioStateUniversity

E-mail:[email protected]

h<p://www.cse.ohio-state.edu/~panda

TalkatHPCAC-Switzerland(Mar‘16)

by

HPCAC-Switzerland(Mar‘16) 2NetworkBasedCompu5ngLaboratory

•  Scalabilityformilliontobillionprocessors•  CollecDvecommunicaDon•  UnifiedRunDmeforHybridMPI+PGASprogramming(MPI+OpenSHMEM,MPI+

UPC,CAF,…)•  InfiniBandNetworkAnalysisandMonitoring(INAM)•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  VirtualizaDon(SR-IOVandContainers)•  Energy-Awareness•  BestPracDce:SetofTuningsforCommonApplicaDons

OverviewofAFewChallengesbeingAddressedbytheMVAPICH2ProjectforExascale


•  IntegratedSupportforGPGPUs–  CUDA-AwareMPI–  GPUDirectRDMA(GDR)Support–  CUDA-awareNon-blockingCollecDves–  SupportforManagedMemory–  EfficientdatatypeProcessing–  SupporDngStreamingapplicaDonswithGDR–  EfficientDeepLearningwithMVAPICH2-GDR

•  IntegratedSupportforMICs•  VirtualizaDon(SR-IOVandContainers)•  Energy-Awareness•  BestPracDce:SetofTuningsforCommonApplicaDons



PCIe

GPU

CPU

NIC

Switch

At Sender: cudaMemcpy(s_hostbuf, s_devbuf, . . .); MPI_Send(s_hostbuf, size, . . .);

At Receiver: MPI_Recv(r_hostbuf, size, . . .); cudaMemcpy(r_devbuf, r_hostbuf, . . .);

• DatamovementinapplicaDonswithstandardMPIandCUDAinterfaces

HighProduc,vityandLowPerformance

MPI+CUDA-Naive


PCIe

GPU

CPU

NIC

Switch

At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_hostbuf + j * blk, s_devbuf + j * blksz, …); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(s_hostbuf + j * block_sz, blksz . . .); } MPI_Waitall();

<<Similar at receiver>>

•  Pipeliningatuserlevelwithnon-blockingMPIandCUDAinterfaces

LowProduc,vityandHighPerformance

MPI+CUDA-Advanced


At Sender: At Receiver: MPI_Recv(r_devbuf, size, …);

inside MVAPICH2

•  StandardMPIinterfacesusedforunifieddatamovement

•  TakesadvantageofUnifiedVirtualAddressing(>=CUDA4.0)

•  OverlapsdatamovementfromGPUwithRDMAtransfers

HighPerformanceandHighProduc,vity

MPI_Send(s_devbuf, size, …);

GPU-AwareMPILibrary:MVAPICH2-GPU


•  OFEDwithsupportforGPUDirectRDMAisdevelopedbyNVIDIAandMellanox

•  OSUhasadesignofMVAPICH2using

GPUDirectRDMA–  HybriddesignusingGPU-DirectRDMA

•  GPUDirectRDMAandHost-basedpipelining

•  AlleviatesP2Pbandwidthbo<lenecksonSandyBridgeandIvyBridge

–  SupportforcommunicaDonusingmulD-rail

–  SupportforMellanoxConnect-IBandConnectXVPIadapters

–  SupportforRoCEwithMellanoxConnectXVPIadapters

GPU-DirectRDMA(GDR)withCUDA

IBAdapter

SystemMemory

GPUMemory

GPU

CPU

Chipset

P2P write: 5.2 GB/s P2P read: < 1.0 GB/s

SNBE5-2670

P2P write: 6.4 GB/s P2P read: 3.5 GB/s

IVBE5-2680V2

SNBE5-2670/

IVBE5-2680V2


CUDA-AwareMPI:MVAPICH2-GDR1.8-2.2Releases•  SupportforMPIcommunicaDonfromNVIDIAGPUdevicememory•  HighperformanceRDMA-basedinter-nodepoint-to-point

communicaDon(GPU-GPU,GPU-HostandHost-GPU)•  Highperformanceintra-nodepoint-to-pointcommunicaDonformulD-

GPUadapters/node(GPU-GPU,GPU-HostandHost-GPU)•  TakingadvantageofCUDAIPC(availablesinceCUDA4.1)inintra-node

communicaDonformulDpleGPUadapters/node•  OpDmizedandtunedcollecDvesforGPUdevicebuffers•  MPIdatatypesupportforpoint-to-pointandcollecDvecommunicaDon

fromGPUdevicebuffers


9

MVAPICH2-GDR-2.2bIntelIvyBridge(E5-2680v2)node-20cores

NVIDIATeslaK40cGPUMellanoxConnect-IBDual-FDRHCA

CUDA7MellanoxOFED2.4withGPU-Direct-RDMA

10x2X

11x

2x

PerformanceofMVAPICH2-GPUwithGPU-DirectRDMA(GDR)

05

1015202530

0 2 8 32 128 512 2K

MV2-GDR2.2b MV2-GDR2.0bMV2w/oGDR

GPU-GPUinternodelatency

MessageSize(bytes)

Latency(us)

2.18us0

50010001500200025003000

1 4 16 64 256 1K 4K

MV2-GDR2.2bMV2-GDR2.0bMV2w/oGDR

GPU-GPUInternodeBandwidth

MessageSize(bytes)

Band

width(M

B/s)

11X

01000200030004000

1 4 16 64 256 1K 4K

MV2-GDR2.2bMV2-GDR2.0bMV2w/oGDR

GPU-GPUInternodeBi-Bandwidth

MessageSize(bytes)

Bi-Ban

dwidth(M

B/s)


LENS(Oct'15) 10

•  Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) •  HoomdBlue Version 1.0.5

•  GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

Applica5on-LevelEvalua5on(HOOMD-blue)

0

500

1000

1500

2000

2500

4 8 16 32

AverageTimeStep

sper

second

(TPS)

NumberofProcesses

MV2 MV2+GDR

0500100015002000250030003500

4 8 16 32AverageTimeStep

sper

second

(TPS)

NumberofProcesses

64KPar5cles 256KPar5cles

2X2X


0

20

40

60

80

100

120

4K 16K 64K 256K 1M

Overla

p(%

)

MessageSize(Bytes)

Medium/LargeMessageOverlap(64GPUnodes)

Ialltoall(1process/node)

Ialltoall(2process/node;1process/GPU)0

20

40

60

80

100

120

4K 16K 64K 256K 1M

Overla

p(%

)

MessageSize(Bytes)

Medium/LargeMessageOverlap(64GPUnodes)

Igather(1process/node)

Igather(2processes/node;1process/GPU)

Plagorm:Wilkes:IntelIvyBridgeNVIDIATeslaK20c+MellanoxConnect-IB

AvailablesinceMVAPICH2-GDR2.2a

CUDA-AwareNon-BlockingCollec5ves

A.Venkatesh,K.Hamidouche,H.Subramoni,andD.K.Panda,OffloadedGPUCollec5vesusingCORE-DirectandCUDACapabili5esonIBClusters,HIPC,2015


Communica5onRun5mewithGPUManagedMemory

●  CUDA6.0NVIDIAintroducedCUDAManaged(orUnified)memoryallowingacommonmemoryallocaDonforGPUorCPUthroughcudaMallocManaged()call

●  SignificantproducDvitybenefitsduetoabstracDonofexplicitallocaDonandcudaMemcpy()

●  ExtendedMVAPICH2toperformcommunicaDonsdirectlyfrommanagedbuffers(AvailableinMVAPICH2-GDR2.2b)

●  OSUMicro-benchmarksextendedtoevaluatetheperformanceofpoint-to-pointandcollecDvecommunicaDonsusingmanagedbuffers●  AvailableinOMB5.2

DSBanerjee,KHamidouche,DKPanda,DesigningHighPerformanceCommunica,onRun,meforGPUManagedMemory:EarlyExperiencesatGPGPU-9Workshopheldinconjunc5onwithPPoPP2016.BarcelonaSpain

0

5

10

15

1 2 4 8 16

32

64

128

256

512

1024

2048

4096

8192

16384

Latency(us)

MessageSize(Bytes)

LatencyH-H MH-MH

0

2000

4000

6000

1 2 4 8 16

32

64

128

256

512

1024

2048

4096

8192

16384Band

width(M

B/s)

MessageSize(Bytes)

BandwidthD-D MD-MD


CPU

Progress

GPU

Time

Initi

ate

Kern

el

Star

t Se

nd

Isend(1)

Initi

ate

Kern

el

Star

t Se

nd

Initi

ate

Kern

el

GPU

CPU

Initi

ate

Kern

el

Star

tSe

nd

Wait For Kernel(WFK)

Kernel on Stream

Isend(1)Existing Design

Proposed Design

Kernel on Stream

Kernel on Stream

Isend(2)Isend(3)

Kernel on Stream

Initi

ate

Kern

el

Star

t Se

nd


Kernel on Stream

Isend(1)

Initi

ate

Kern

el

Star

t Se

nd


Kernel on Stream

Isend(1) Wait

WFK

Star

t Se

nd

Wait

Progress

Start Finish Proposed Finish Existing

WFK

WFK

Expected Benefits

MPIDatatypeProcessing(Communica5onOp5miza5on)

Wasteofcompu5ngresourcesonCPUandGPUCommonScenario

*Buf1, Buf2…contain non-conDguousMPIDatatype

MPI_Isend(A,..Datatype,…)MPI_Isend(B,..Datatype,…)MPI_Isend(C,..Datatype,…)MPI_Isend(D,..Datatype,…)…MPI_Waitall(…);


Applica5on-LevelEvalua5on(HaloExchange-Cosmo)

0

0.5

1

1.5

16 32 64 96

Normalized

Execu5o

nTime

NumberofGPUs

CSCSGPUclusterDefault Callback-based Event-based

0

0.5

1

1.5

4 8 16 32

Normalized

Execu5o

nTime

NumberofGPUs

WilkesGPUClusterDefault Callback-based Event-based

•  2Ximprovementon32GPUsnodes•  30%improvementon96GPUnodes(8GPUs/node)

C.Chu,K.Hamidouche,A.Venkatesh,D.Banerjee,H.Subramoni,andD.K.Panda,Exploi5ngMaximalOverlapforNon-Con5guousDataMovementProcessingonModernGPU-enabledSystems,IPDPS’16


•  PipelineddataparallelcomputephasesthatformthecruxofstreamingapplicaDonslendthemselvesforGPGPUs

•  DatadistribuDontoGPGPUsitesoccuroverPCIewithinthenodeandoverInfiniBandinterconnectsacrossnodes

Courtesy:Agarwalla,Bikash,etal."Streamline:AschedulingheurisDcforstreamingapplicaDonsonthegrid."ElectronicImaging2006

•  BroadcastoperaDonisakeydictatorofthroughputofstreamingapplicaDons

•  CurrentBroadcastoperaDononGPUclustersdoesnottakeadvantageof•  IBHardwareMCAST•  GPUDirectRDMA

NatureofStreamingApplica5ons


SGL-baseddesignforEfficientBroadcastOpera5ononGPUSystems

•  Currentdesignislimitedbytheexpensivecopiesfrom/toGPUs

•  ProposedseveralalternaDvedesignstoavoidtheoverheadofthecopy

•  Loopback,GDRCOPYandhybrid•  Highperformanceandscalability•  SDllusesPCIresourcesforHost-GPUcopies

•  ProposedSGL-baseddesign•  CombinesIBMCASTandGPUDirectRDMAfeatures•  HighperformanceandscalabilityforD-Dbroadcast•  DirectcodepathbetweenHCAandGPU•  FreePCIresources

•  3Ximprovementinlatency

3X

A.   Venkatesh,H.Subramoni,K.Hamidouche,andD.K.Panda,AHighPerformanceBroadcastDesignwithHardwareMul5castandGPUDirectRDMAforStreamingApplica5onsonInfiniBandClusters,IEEEInt’lConf.onHighPerformanceCompu5ng(HiPC’14)


Accelera5ngDeepLearningwithMVAPICH2-GDR•  Caffe:AflexibleandlayeredDeepLearning

framework.

•  BenefitsandWeaknesses

–  MulD-GPUTrainingwithinasinglenode

–  PerformancedegradaDonforGPUsacrossdifferentsockets

•  CanweenhanceCaffewithMVAPICH2-GDR?

–  Caffe-Enhanced:ACUDA-AwareMPIversion

–  EnablesScale-up(withinanode)andScale-out(acrossmulD-GPUnodes)

–  IniDalEvaluaDonsuggestsupto8XreducDonintrainingDmeonCIFAR-10dataset

8ximprovement


•  IntegratedSupportforGPGPUs•  IntegratedSupportforMICs•  VirtualizaDon(SR-IOVandContainers)•  Energy-Awareness•  BestPracDce:SetofTuningsforCommonApplicaDons



MPIApplica5onsonMICClusters

Xeon XeonPhi

MulD-coreCentric

Many-coreCentric

MPIProgram

MPIProgram

OffloadedComputaDon

MPIProgram MPIProgram

MPIProgram

Host-only

Offload(/reverseOffload)

Symmetric

Coprocessor-only

• FlexibilityinlaunchingMPIjobsonclusterswithXeonPhi


MVAPICH2-MIC2.0DesignforClusterswithIBandMIC

•  OffloadMode

•  IntranodeCommunicaDon

•  Coprocessor-onlyandSymmetricMode

•  InternodeCommunicaDon

•  Coprocessors-onlyandSymmetricMode

•  MulD-MICNodeConfiguraDons

•  Runningonthreemajorsystems

•  Stampede,Blueridge(VirginiaTech)andBeacon(UTK)


MIC-Remote-MICP2PCommunica5onwithProxy-basedCommunica5on

Bandwidth

Bemer

Bemer

Bemer

Latency(LargeMessages)

0 1000 2000 3000 4000 5000

8K 32K 128K 512K 2M

Lat

ency

(use

c)

Message Size (Bytes)

0

2000

4000

6000

1 16 256 4K 64K 1M Band

width(M

B/sec)


5236

Intra-socketP2P

Inter-socketP2P

0

5000

10000

15000

8K 32K 128K 512K 2M

Lat

ency

(use

c)


Latency(LargeMessages)

0

2000

4000

6000

1 16 256 4K 64K 1M Band

width(M

B/sec)

Message Size (Bytes) Bem

er

5594

Bandwidth


Op5mizedMPICollec5vesforMICClusters(Allgather&Alltoall)

A.Venkatesh,S.Potluri,R.Rajachandrasekar,M.Luo,K.HamidoucheandD.K.Panda-HighPerformanceAlltoallandAllgatherdesignsforInfiniBandMICClusters;IPDPS’14,May2014

0

10000

20000

30000

1 2 4 8 16 32 64 128256512 1K

Latency(usecs)

MessageSize(Bytes)

32-Node-Allgather(16H+16M)SmallMessageLatencyMV2-MIC

MV2-MIC-Opt

0

500

1000

1500

8K 16K 32K 64K 128K256K512K 1M

Latency(usecs)

MessageSize(Bytes)

32-Node-Allgather(8H+8M)LargeMessageLatencyMV2-MIC

MV2-MIC-Opt

0

500

1000

4K 8K 16K 32K 64K 128K256K512K

Latency(usecs)

MessageSize(Bytes)

32-Node-Alltoall(8H+8M)LargeMessageLatencyMV2-MIC

MV2-MIC-Opt

0

20

40

60

MV2-MIC-Opt MV2-MICExecu5

onTim

e(secs)

32Nodes(8H+8M),Size=2K*2K*1K

P3DFFTPerformanceCommunicaDonComputaDon

76%58%

55%


•  VirtualizaDonhasmanybenefits–  Fault-tolerance–  JobmigraDon–  CompacDon

•  HavenotbeenverypopularinHPCduetooverheadassociatedwithVirtualizaDon

•  NewSR-IOV(SingleRoot–IOVirtualizaDon)supportavailablewithMellanoxInfiniBandadapterschangesthefield

•  EnhancedMVAPICH2supportforSR-IOV•  MVAPICH2-Virt2.1(withandwithoutOpenStack)ispubliclyavailable•  HowabouttheContainerssupport?

CanHPCandVirtualiza5onbeCombined?

J.Zhang,X.Lu,J.Jose,R.ShiandD.K.Panda,CanInter-VMShmemBenefitMPIApplica5onsonSR-IOVbasedVirtualizedInfiniBandClusters?EuroPar'14

J.Zhang,X.Lu,J.Jose,M.Li,R.ShiandD.K.Panda,HighPerformanceMPILibrayoverSR-IOVenabledInfiniBandClusters,HiPC’14

J.Zhang,X.Lu,M.ArnoldandD.K.Panda,MVAPICH2OverOpenStackwithSR-IOV:anEfficientApproachtobuildHPCClouds,CCGrid’15


•  RedesignMVAPICH2tomakeitvirtualmachineaware–  SR-IOVshowsneartonaDve

performanceforinter-nodepointtopointcommunicaDon

–  IVSHMEMofferszero-copyaccesstodataonsharedmemoryofco-residentVMs

–  LocalityDetector:maintainsthelocalityinformaDonofco-residentvirtualmachines

–  CommunicaDonCoordinator:selectsthecommunicaDonchannel(SR-IOV,IVSHMEM)adapDvely

OverviewofMVAPICH2-VirtwithSR-IOVandIVSHMEM

Host Environment

Guest 1

Hypervisor PF Driver

Infiniband Adapter

Physical Function

user space

kernel space

MPI proc

PCI Device

VF Driver

Guest 2user space

kernel space

MPI proc

PCI Device

VF Driver

Virtual Function

Virtual Function

/dev/shm/

IV-SHM

IV-Shmem Channel

SR-IOV Channel

J.Zhang,X.Lu,J.Jose,R.Shi,D.K.Panda.CanInter-VMShmemBenefitMPIApplicaDonsonSR-IOVbasedVirtualizedInfiniBandClusters?Euro-Par,2014.

J.Zhang,X.Lu,J.Jose,R.Shi,M.Li,D.K.Panda.HighPerformanceMPILibraryoverSR-IOVEnabledInfiniBandClusters.HiPC,2014.


Nova

Glance

Neutron

Swift

Keystone

Cinder

Heat

Ceilometer

Horizon

VM

Backup volumes in

Stores images in

Provides images

Provides Network

Provisions

Provides Volumes

Monitors

Provides UI

Provides Auth for

Orchestrates cloud

•  OpenStackisoneofthemostpopularopen-sourcesoluDonstobuildcloudsandmanagevirtualmachines

•  DeploymentwithOpenStack–  SupporDngSR-IOVconfiguraDon

–  SupporDngIVSHMEMconfiguraDon

–  VirtualMachineawaredesignofMVAPICH2withSR-IOV

•  AnefficientapproachtobuildHPCCloudswithMVAPICH2-VirtandOpenStack

MVAPICH2-VirtwithSR-IOVandIVSHMEMoverOpenStack

J.Zhang,X.Lu,M.Arnold,D.K.Panda.MVAPICH2overOpenStackwithSR-IOV:AnEfficientApproachtoBuildHPCClouds.CCGrid,2015.


0

50

100

150

200

250

300

350

400

milc leslie3d pop2 GAPgeofem zeusmp2 lu

Execu5

onTim

e(s)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-NaDve

1% 9.5%

0

1000

2000

3000

4000

5000

6000

22,20 24,10 24,16 24,20 26,10 26,16

Execu5

onTim

e(m

s)

ProblemSize(Scale,Edgefactor)

MV2-SR-IOV-Def

MV2-SR-IOV-Opt

MV2-NaDve 2%

•  32VMs,6Core/VM

•  ComparedtoNaDve,2-5%overheadforGraph500with128Procs

•  ComparedtoNaDve,1-9.5%overheadforSPECMPI2007with128Procs

Applica5on-LevelPerformanceonChameleon

SPECMPI2007 Graph500

5%


NSFChameleonCloud:APowerfulandFlexibleExperimentalInstrument •  Large-scaleinstrument

–  TargeDngBigData,BigCompute,BigInstrumentresearch–  ~650nodes(~14,500cores),5PBdiskovertwosites,2sitesconnectedwith100Gnetwork

•  Reconfigurableinstrument–  BaremetalreconfiguraDon,operatedassingleinstrument,graduatedapproachforease-of-use

•  Connectedinstrument–  WorkloadandTraceArchive–  PartnershipswithproducDonclouds:CERN,OSDC,Rackspace,Google,andothers–  Partnershipswithusers

•  Complementaryinstrument–  ComplemenDngGENI,Grid’5000,andothertestbeds

•  Sustainableinstrument–  IndustryconnecDons

h<p://www.chameleoncloud.org/


0

2

4

6

8

10

12

14

16

18

1 2 4 8 16 32 64 128256512 1k 2k 4k 8k 16k32k64k

Latency(us)

MessageSize(Bytes)

Container-Def

Container-Opt

NaDve

0

2000

4000

6000

8000

10000

12000

14000

16000

1 2 4 8 16 32 64 128256512 1k 2k 4k 8k 16k 32k 64k

Band

width(M

Bps)

MessageSize(Bytes)

Container-Def

Container-Opt

NaDve

•  Intra-NodeInter-Container

•  ComparedtoContainer-Def,upto81%and191%improvementonLatencyandBW

•  ComparedtoNaDve,minoroverheadonLatencyandBW

ContainersSupport:MVAPICH2Intra-nodePoint-to-PointPerformanceonChameleon

81%

191%


0

500

1000

1500

2000

2500

3000

3500

4000

22,16 22,20 24,16 24,20 26,16 26,20

Execu5

onTim

e(m

s)

ProblemSize(Scale,Edgefactor)

Container-Def

Container-Opt

NaDve

0

10

20

30

40

50

60

70

80

90

100

MG.D FT.D EP.D LU.D CG.D

Execu5

onTim

e(s)

Container-Def

Container-Opt

NaDve

•  64Containersacross16nodes,pining4CoresperContainer

•  ComparedtoContainer-Def,upto11%and16%ofexecuDonDmereducDonforNASandGraph500

•  ComparedtoNaDve,lessthan9%and4%overheadforNASandGraph500

•  Op5mizedContainersupportwillbeavailablewiththenextreleaseofMVAPICH2-Virt

ContainersSupport:Applica5on-LevelPerformanceonChameleon

Graph500 NAS

11%

16%


DesigningEnergy-Aware(EA)MPIRun5me

EnergySpentinCommunicaDonRouDnes

EnergySpentinComputaDonRouDnes

OverallapplicaDonEnergyExpenditure

Point-to-pointRouDnes

CollecDveRouDnes

RMARouDnes

MVAPICH2-EADesigns

MPITwo-sidedandcollecDves(ex:MVAPICH2)

OtherPGASImplementaDons(ex:OSHMPI)One-sidedrunDmes(ex:ComEx)

Impact MPI-3RMAImplementaDons(ex:MVAPICH2)


•  MVAPICH2-EA2.1(Energy-Aware)•  Awhite-boxapproach•  NewEnergy-EfficientcommunicaDonprotocolsforpt-ptandcollecDveoperaDons•  IntelligentlyapplytheappropriateEnergysavingtechniques•  ApplicaDonobliviousenergysaving

•  OEMT•  AlibraryuDlitytomeasureenergyconsumpDonforMPIapplicaDons•  WorkswithallMPIrunDmes•  PRELOADopDonforprecompiledapplicaDons•  DoesnotrequireROOTpermission:

•  AsafekernelmoduletoreadonlyasubsetofMSRs

Energy-AwareMVAPICH2&OSUEnergyManagementTool(OEMT)


•  AnenergyefficientrunDmethatprovidesenergysavingswithoutapplicaDonknowledge

•  UsesautomaDcallyandtransparentlythebestenergylever

•  ProvidesguaranteesonmaximumdegradaDonwith5-41%savingsat<=5%degradaDon

•  PessimisDcMPIappliesenergyreducDonlevertoeachMPIcall

MVAPICH2-EA:Applica5onObliviousEnergy-Aware-MPI(EAM)

ACaseforApplica5on-ObliviousEnergy-EfficientMPIRun5meA.Venkatesh,A.Vishnu,K.Hamidouche,N.Tallent,D.

K.Panda,D.Kerbyson,andA.Hoise,Supercompu5ng‘15,Nov2015[BestStudentPaperFinalist]

1


MPI-3RMAEnergySavingswithProxy-Applica5ons

0

10

20

30

40

50

60

512 256 128

Seco

nds

#Processes

Graph500 (Execution Time)

optimistic

pessimistic

EAM-RMA

0

50000

100000

150000

200000

250000

300000

350000

512 256 128

Joul

es

#Processes

Graph500 (Energy Usage)

optimistic pessimistic EAM-RMA

46%

•  MPI_Win_fencedominatesapplicaDonexecuDonDmeingraph500

•  Between128and512processes,EAM-RMAyieldsbetween31%and46%savingswithnodegradaDoninexecuDonDmeincomparisonwiththedefaultopDmisDcMPIrunDme


0

500000

1000000

1500000

2000000

2500000

3000000

512 256 128

Joul

es

#Processes

SCF (Energy Usage)

optimistic

pessimistic

EAM-RMA

0

100

200

300

400

500

600

512 256 128

Seco

nds

#Processes

SCF (Execution Time)

optimistic pessimistic EAM-RMA

MPI-3RMAEnergySavingswithProxy-Applica5ons

42%

•  SCF(self-consistentfield)calculaDonspendsnearly75%totalDmeinMPI_Win_unlockcall

•  With256and512processes,EAM-RMAyields42%and36%savingsat11%degradaDon(closetopermi<eddegradaDonρ=10%)

•  128processesisanexcepDondue2-sidedand1-sidedinteracDon

•  MPI-3RMAEnergy-efficientsupportwillbeavailableinupcomingMVAPICH2-EArelease


•  MPIrunDmehasmanyparameters•  Tuningasetofparameterscanhelpyoutoextracthigherperformance•  CompiledalistofsuchcontribuDonsthroughtheMVAPICHWebsite

–  h<p://mvapich.cse.ohio-state.edu/best_pracDces/

•  IniDallistofapplicaDons–  Amber–  HoomdBlue–  HPCG–  Lulesh–  MILC–  MiniAMR–  Neuron–  SMG2000

•  SoliciDngaddiDonalcontribuDons,sendyourresultstomvapich-helpatcse.ohio-state.edu.Wewilllinktheseresultswithcreditstoyou.

Applica5ons-LevelTuning:Compila5onofBestPrac5ces


MVAPICH2–PlansforExascale•  PerformanceandMemoryscalabilitytoward1Mcores•  Hybridprogramming(MPI+OpenSHMEM,MPI+UPC,MPI+CAF…)

•  Supportfortask-basedparallelism(UPC++)*•  EnhancedOpDmizaDonforGPUSupportandAccelerators•  TakingadvantageofadvancedfeaturesofMellanoxInfiniBand

•  On-DemandPaging(ODP)•  Swith-IB2SHArP•  GID-basedsupport

•  EnhancedInter-nodeandIntra-nodecommunicaDonschemesforupcomingarchitectures•  OpenPower*•  OmniPath-PSM2*•  KnightsLanding

•  Extendedtopology-awarecollecDves•  ExtendedEnergy-awaredesignsandVirtualizaDonSupport•  ExtendedSupportforMPIToolsInterface(asinMPI3.0)•  ExtendedCheckpoint-RestartandmigraDonsupportwithSCR•  Supportfor*featureswillbeavailableinMVAPICH2-2.2RC1


•  Exascalesystemswillbeconstrainedby–  Power–  Memorypercore–  Datamovementcost–  Faults

•  ProgrammingModelsandRunDmesforHPCneedtobedesignedfor–  Scalability–  Performance–  Fault-resilience–  Energy-awareness–  Programmability–  ProducDvity

•  Highlightedsomeoftheissuesandchallenges•  NeedconDnuousinnovaDononallthesefronts

LookingintotheFuture….


FundingAcknowledgmentsFundingSupportby

EquipmentSupportby


PersonnelAcknowledgmentsCurrentStudents

–  A.AugusDne(M.S.)

–  A.Awan(Ph.D.)–  S.Chakraborthy(Ph.D.)

–  C.-H.Chu(Ph.D.)–  N.Islam(Ph.D.)

–  M.Li(Ph.D.)

PastStudents–  P.Balaji(Ph.D.)

–  S.Bhagvat(M.S.)

–  A.Bhat(M.S.)

–  D.BunDnas(Ph.D.)

–  L.Chai(Ph.D.)

–  B.Chandrasekharan(M.S.)

–  N.Dandapanthula(M.S.)

–  V.Dhanraj(M.S.)

–  T.Gangadharappa(M.S.)–  K.Gopalakrishnan(M.S.)

–  G.Santhanaraman(Ph.D.)–  A.Singh(Ph.D.)

–  J.Sridhar(M.S.)

–  S.Sur(Ph.D.)

–  H.Subramoni(Ph.D.)

–  K.Vaidyanathan(Ph.D.)

–  A.Vishnu(Ph.D.)

–  J.Wu(Ph.D.)

–  W.Yu(Ph.D.)

PastResearchScien,st–  S.Sur

CurrentPost-Doc–  J.Lin

–  D.Banerjee

CurrentProgrammer–  J.Perkins

PastPost-Docs–  H.Wang

–  X.Besseron–  H.-W.Jin

–  M.Luo

–  W.Huang(Ph.D.)–  W.Jiang(M.S.)

–  J.Jose(Ph.D.)

–  S.Kini(M.S.)

–  M.Koop(Ph.D.)

–  R.Kumar(M.S.)

–  S.Krishnamoorthy(M.S.)

–  K.Kandalla(Ph.D.)

–  P.Lai(M.S.)

–  J.Liu(Ph.D.)

–  M.Luo(Ph.D.)–  A.Mamidala(Ph.D.)

–  G.Marsh(M.S.)

–  V.Meshram(M.S.)

–  A.Moody(M.S.)

–  S.Naravula(Ph.D.)

–  R.Noronha(Ph.D.)

–  X.Ouyang(Ph.D.)

–  S.Pai(M.S.)

–  S.Potluri(Ph.D.)

–  R.Rajachandrasekar(Ph.D.)

–  K.Kulkarni(M.S.)–  M.Rahman(Ph.D.)

–  D.Shankar(Ph.D.)–  A.Venkatesh(Ph.D.)

–  J.Zhang(Ph.D.)

–  E.Mancini–  S.Marcarelli

–  J.Vienne

CurrentResearchScien,stsCurrentSeniorResearchAssociate–  H.Subramoni

–  X.Lu

PastProgrammers–  D.Bureddy

-K.Hamidouche

CurrentResearchSpecialist–  M.Arnold


Interna5onalWorkshoponCommunica5onArchitecturesatExtremeScale(Exacomm)

ExaComm2015washeldwithInt’lSupercompuDngConference(ISC‘15),atFrankfurt,Germany,onThursday,July16th,2015

OneKeynoteTalk:JohnM.Shalf,CTO,LBL/NERSC

FourInvitedTalks:DrorGoldenberg(Mellanox);MarDnSchulz(LLNL);CyrielMinkenberg(IBM-Zurich);Arthur(Barney)Maccabe(ORNL)

Panel:RonBrightwell(Sandia)TwoResearchPapers

ExaComm2016willbeheldinconjuncDonwithISC’16h<p://web.cse.ohio-state.edu/~subramon/ExaComm16/exacomm16.html

TechnicalPaperSubmissionDeadline:Friday,April15,2016


[email protected]

ThankYou!

TheHigh-PerformanceBigDataProjecth<p://hibd.cse.ohio-state.edu/

Network-BasedCompuDngLaboratoryh<p://nowlab.cse.ohio-state.edu/

TheMVAPICH2Projecth<p://mvapich.cse.ohio-state.edu/

Addressing Emerging Challenges in Designing HPC Runtimes

Technology

Transcript of Addressing Emerging Challenges in Designing HPC Runtimes