Designing and Building Efficient HPC Cloud with Modern ...

DesigningandBuildingEfficientHPCCloudwithModernNetworkingTechnologies

onHeterogeneousHPCClusters

Jie Zhang

Dr.Dhabaleswar K.Panda(Advisor)

DepartmentofComputerScience&EngineeringTheOhioStateUniversity

SC2017DoctoralShowcase 2NetworkBasedComputingLaboratory

Outline

• Introduction

• ProblemStatement

• DetailedDesignsandResults

• ImpactonHPCCommunity

• Conclusion


• CloudComputingfocusesonmaximizingtheeffectivenessofthesharedresources

• Virtualizationisthekeytechnologybehind

• Widelyadoptedinindustrycomputingenvironment

• IDCForecastsWorldwidePublicITCloudServicesspendingwillreach$195billionby2020(Courtesy:http://www.idc.com/getdoc.jsp?containerId=prUS41669516)

CloudComputingandVirtualization

VirtualizationCloudComputing


DriversofModernHPCClusterandCloudArchitecture

• Multi-/Many-coretechnologies

• Accelerators(GPUs/Co-processors)

• Largememorynodes

• RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)

• SingleRootI/OVirtualization(SR-IOV)

HighPerformanceInterconnects–InfiniBand(withSR-IOV)

<1useclatency,200GbpsBandwidth>Multi-/Many-core

Processors

Accelerators(GPUs/Co-processors)

Largememorynodes(Upto2TB)

Cloud CloudSDSCComet TACCStampede


SingleRootI/OVirtualization(SR-IOV) isprovidingnewopportunitiestodesignHPCcloudwithverylittlelowoverheadthroughbypassinghypervisor

SingleRootI/OVirtualization(SR-IOV)

• Allowsasinglephysicaldevice,oraPhysicalFunction(PF),topresentitselfasmultiplevirtualdevices,orVirtualFunctions(VFs)

• VFsaredesignedbasedontheexistingnon-virtualizedPFs,noneedfordriverchange

• EachVFcanbededicatedtoasingleVMthroughPCIpass-through

4. Performance comparisons between IVShmem backed and native mode MPI li-braries, using HPC applications

The evaluation results indicate that IVShmem can improve point to point and collectiveoperations by up to 193% and 91%, respectively. The application execution time can bedecreased by up to 96%, compared to SR-IOV. The results further show that IVShmemjust brings small overheads, compared with native environment.

The rest of the paper is organized as follows. Section 2 provides an overview ofIVShmem, SR-IOV, and InfiniBand. Section 3 describes our prototype design and eval-uation methodology. Section 4 presents the performance analysis results using micro-benchmarks and applications, scalability results, and comparison with native mode. Wediscuss the related work in Section 5, and conclude in Section 6.

2 BackgroundInter-VM Shared Memory (IVShmem) (e.g. Nahanni) [15] provides zero-copy accessto data on shared memory of co-resident VMs on KVM platform. IVShmem is designedand implemented mainly in system calls layer and its interfaces are visible to user spaceapplications as well. As shown in Figure 2(a), IVShmem contains three components:the guest kernel driver, the modified QEMU supporting PCI device, and the POSIXshared memory region on the host OS. The shared memory region is allocated by hostPOSIX operations and mapped to QEMU process address space. The mapped memoryin QEMU can be used by guest applications by being remapped to user space in guestVMs. Evaluation results illustrate that both micro-benchmarks and HPC applicationscan achieve better performance with IVShmem support.

Qemu Userspace

Guest 1

Userspace

kernelPCI Device

mmap region

Qemu Userspace

Guest 2

Userspace

kernel

mmap region

Qemu Userspace

Guest 3

Userspace

kernelPCI Device

mmap region

/dev/shm/<name>

PCI Device

Host

mmap mmap mmap

shared mem fd

eventfds

(a) Inter-VM Shmem Mechanism [15]

Guest 1Guest OS

VF Driver

Guest 2Guest OS

VF Driver

Guest 3Guest OS

VF Driver

Hypervisor PF Driver

I/O MMU

SR-IOV Hardware

Virtual Function

Virtual Function

Virtual Function

Physical Function

PCI Express

(b) SR-IOV Mechanism [22]

Fig. 2. Overview of Inter-VM Shmem and SR-IOV Communication Mechanisms

Single Root I/O Virtualization (SR-IOV) is a PCI Express (PCIe) standard whichspecifies the native I/O virtualization capabilities in PCIe adapters. As shown in Fig-ure 2(b), SR-IOV allows a single physical device, or a Physical Function (PF), to presentitself as multiple virtual devices, or Virtual Functions (VFs). Each virtual device can bededicated to a single VM through the PCI pass-through, which allows each VM to di-rectly access the corresponding VF. Hence, SR-IOV is a hardware-based approach to


DoesitsufficetobuildefficientHPCcloudwithonlySR-IOV?NO.

• Notsupportlocality-awarecommunication,co-locatedVMsstillhastouseSR-IOVchannel

• NotsupportVMmigrationbecauseofdevicepassthrough

• Notproperlymanageandisolatecriticalvirtualizedresource


• CanMPIruntimeberedesignedtoprovidevirtualizationsupportforVMs/ContainerswhenbuildingHPCclouds?

• HowmuchbenefitscanbeachievedonHPCcloudswithredesignedMPIruntimeforscientifickernelsandapplications?

• Canfault-tolerance/resilience(LiveMigration)besupportedonSR-IOVenabledHPCclouds?

• Canweco-designwithresourcemanagementandschedulingsystemstoenableHPCcloudsonmodernHPCsystems?

Problem Statements


ResearchFramework


• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002

– MVAPICH2-X(MPI+PGAS),Availablesince2011

– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014

– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015

– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015

– Usedbymorethan2,825organizationsin85countries

– Morethan432,000(>0.4million)downloadsfromtheOSUsitedirectly– EmpoweringmanyTOP500clusters(Jul‘17ranking)

• 1st ranked10,649,640-corecluster(SunwayTaihuLight)atNSC,Wuxi,China

• 15th ranked241,108-corecluster(Pleiades)atNASA

• 20th ranked522,080-corecluster(Stampede)atTACC

• 44th ranked74,520-corecluster(Tsubame2.5)atTokyoInstituteofTechnologyandmanyothers

– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)

– http://mvapich.cse.ohio-state.edu

MVAPICH2Project


Locality-awareMPICommunicationwithSR-IOVandIVShmemApplication

MPI Layer

ADI3 Layer

SMP Channel

Network Channel

Shared Memory

InfiniBand API

MPI Library

CommunicationDevice APIs

Native Hardware

Application

MPI Layer

ADI3 Layer

IVShmem Channel

SR-IOV Channel

Shared Memory InfiniBand API

Virtual MachineAware

CommunicationDevice APIs

Virtualized Hardware

Communication Coordinator

Locality Detector

SMP Channel

• MPIlibrary running in native and virtualization environments• Invirtualizedenvironment

- Supportshared-memorychannels(SMP,IVShmem)andSR-IOVchannel- Localitydetection- Communicationcoordination- Communicationoptimizationsondifferentchannels(SMP,IVShmem,SR-IOV;RC,UD)

J.Zhang,X.Lu,J.JoseandD.K.Panda,HighPerformanceMPILibraryoverSR-IOVEnabledInfiniBandClusters,TheInternationalConferenceonHighPerformanceComputing(HiPC’14),Dec2014


Application Performance(NAS&P3DFFT)

• Proposed design delivers up to 43% (IS) improvement for NAS• Proposed design brings 29%,33%,29%and20% improvement for INVERSE,RAND,

SINEandSPEC

0

2

4

6

8

10

12

14

16

18

FT LU CG MG IS

ExecutionTime(s)

NAS B Class

SR-IOV

Proposed

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

SPEC INVERSE RAND SINE

ExecutionTimes

(s)

P3DFFT 512*512*512

SR-IOV

Proposed

NAS-32 VMs (8 VMs per node) P3DFFT-32 VMs (8 VMs per node)

43%

20%29%

33% 29%


SR-IOV-enabledVMMigrationSupportonHPCClouds


HighPerformanceSR-IOVenabledVMMigrationFrameworkforMPIApplications

MPI

Host

Guest VM1

Hypervisor

IBAdapterIVShmem

Network Suspend Trigger

Read-to-Migrate Detector

Network ReactiveNotifier

Controller

MPIGuest VM2

MPI

Host

Guest VM1

VF /SR-IOV

Hypervisor

IBAdapter IVShmemEthernet

Adapter

MPIGuest VM2

EthernetAdapter

Migration Done

Detector

VF /SR-IOV

VF /SR-IOV

VF /SR-IOV

Migration Trigger

J.Zhang,X.Lu,D.K.Panda.High-PerformanceVirtualMachineMigrationFrameworkforMPIApplicationsonSR-IOVenabledInfiniBandClusters.IPDPS,2017

• TwoChallenges

1. Detach/re-attachvirtualizeddevices

2. MaintainIBConnection

• Challenge1:MultipleparallellibrariestocoordinatewithVMduringmigration(detach/reattachSR-IOV/IVShmem,migrateVMs,migrationstatus)

• Challenge2:MPIruntimehandlesIBconnection suspendingandreactivating

• ProposeProgressEngine(PE)andMigrationThreadbased(MT)designtooptimizeVMmigrationandMPIapplicationperformance


ProposedDesignofMPIRuntime

Time Comp

MPICall

RTime

RSMPI Call Suspend Channel Reactivate Channel Control Msg

Migration Migration

Lock/Unlock Communication

Comp Computation

P 0

ControllerPre-migration

S

Ready-to-MigrateMigration

Post-Migration

MPI Call

MPICall

TimeNo-migration P 0 Comp

MPICall

P 0

MPICall

ThreadPre-

migrationReady-to-migrate Migration

Post-migration

MPICall

Comp

Controller

TimeCompMigration-thread based Worst Scenario

P 0

MPICall

Thread Pre-migration

SReady-to-migrate Migration

Post-migrationController

S R

R

Migration-thread based Typical Scenario

Progress Engine Based

Migration Signal Detection Down-Time for VM Live Migration

MPI Call


• 8VMsintotaland1VMcarriesoutmigrationduringapplicationrunning

• ComparedwithNM,MT- worstandPEincursomeoverhead

• MT-typicalallowsmigrationtobecompletelyoverlappedwithcomputation

ApplicationPerformance

051015202530

LU.B EP.B MG.B CG.B FT.B

ExecutionTime(s) PE

MT-worstMT-typicalNM

0.0

0.1

0.2

0.3

0.4

20,10 20,16 20,20 22,10

ExecutionTime(s) PE

MT-worstMT-typicalNM

Graph500NAS


HighPerformanceMPICommunicationforNestedVirtualization

QPI

Mem

ory

Con

trol

ler

core 4 core 5 core 6 core 7

NUMA 0

VM 0Container 0

Mem

ory

Con

trol

lercore 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11

core 12 core 13 core 14 core 15

VM 1Container 2Container 1 Container 3

1 23 4

Two-Layer NUMA Aware Communication Coordinator

NUMA 1

Container Locality Detector VM Locality Detector

Nested Locality Combiner

Two-Layer Locality Detector

CMAChannel

SHared Memory (SHM) Channel

Network (HCA)Channel

Two-LayerLocalityDetector:DynamicallydetectingMPIprocessesintheco-residentcontainersinsideoneVMaswellastheonesintheco-residentVMs

Two-LayerNUMAAwareCommunicationCoordinator: Leveragenestedlocalityinfo,NUMAarchitectureinfoandmessagetoselectappropriatecommunicationchannel

J.Zhang,X.LuandD.K.Panda,DesigningLocalityandNUMAAwareMPIRuntimeforNestedVirtualizationbasedHPCCloudwithSR-IOVEnabledInfiniBand,The13thACMSIGPLAN/SIGOPSInternationalConferenceonVirtualExecutionEnvironments(VEE’17),April2017


Two-LayerNUMAAwareCommunicationCoordinator

NUMA Loader Nested Locality Loader

Message Parser

Communication Coordinator

Two-Layer NUMA Aware Communication Coordinator

CMA Channel SHared Memory (SHM) Channel

Network (HCA)Channel

Two-Layer Locality Detector

• NestedLocalityLoaderreadslocalityinfoofdestinationprocessfromTwo-LayerLocalityDetector

• NUMALoaderreadsinfoofVM/containerplacementstodecideonwhichNUMAnodethedestinationprocessispinning

• MessageParserobtainsmessageattributes,e.g.,messagetypeandmessagesize


ApplicationsPerformance

• 256processesacross64containerson16nodes• ComparedwithDefault,enhanced-hybriddesignreducesupto16% (28,16)and10% (LU)of

executiontimeforGraph500andNAS,respectively• Comparedwiththe1Layercase,enhanced-hybriddesignalsobringsupto12%(28,16)and6% (LU)

performancebenefit.

0

50

100

150

200

IS MG EP FT CG LU

ExecutionTime(s)

ClassDNASDefault

1Layer

2Layer-Enhanced-Hybrid

0

2

4

6

8

10

22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16

BFSExecutionTime(s)

Graph500Default

1Layer

2Layer-Enhanced-Hybrid


Compu

teNod

es

MPI MPI

MPI

MPI

MPI

MPI

MPI

MPI

MPI

MPI

VM VM

VMVM

VM

VM

VM VM

ExclusiveAllocationsSequentialJobs(EASJ)

ExclusiveAllocationsConcurrentJobs

(EACJ)

Shared-hostAllocationsConcurrentJobs(SACJ)

Typical UsageScenarios


Submit Job SLURMctld

VM Configuration File

physical node

SLURMd

SLURMd

VM Launching/Reclaiming

libvirtd

VM1

VF IVSHMEM

VM2

VF IVSHMEM

physical node

SLURMdphysical

node

SLURMd

sbatch File

MPI MPI

physical resource request

physical node list

launch VMs

Lustre

Image Pool

1. SR-IOV virtual function

2. IVSHMEM device3. Network setting4. Image management5. Launching VMs and

check availability6. Mount global storage,

etc.

….

Slurm-VArchitectureOverview


AlternativeDesignsofSlurm-V

• Slurm SPANKPluginbaseddesign– UtilizeSPANKplugintoreadVMconfiguration,launch/reclaimVM

– FilebasedlocktodetectoccupiedVFandexclusivelyallocatefreeVF

– AssignauniqueIDtoeachIVSHMEMdeviceanddynamicallyattachtoeachVM

– InheritadvantagesfromSlurm: coordination,scalability,security

• Slurm SPANKPluginoverOpenStackbaseddesign– OffloadVMlaunch/reclaimtounderlyingOpenStackframework

– PCIWhitelisttopassthrough freeVFtoVM

– ExtendNovatoenableIVSHMEMwhenlaunchingVM

– InheritadvantagefrombothOpenStackandSlurm: componentoptimization,performance


0

50

100

150

200

250

22,10 22,16 22,20ProblemSize(Scale,Edgefactor)

VM

Native

• 32VMsacross8nodes,6Cores/VM

• EASJ- ComparedtoNative,lessthan4%overhead

• SACJ,EACJ– lessthan9%overhead,whenrunningNASasconcurrentjobwith64Procs

EASJ SACJ

ApplicationsPerformance

0

500

1000

1500

2000

2500

3000

24,16 24,20 26,10

BFSExecutionTime(m

s)

ProblemSize(Scale,Edgefactor)

VM

Native

0

50

100

150

200

250

2210 2216 2220ProblemSize(Scale,Edgefactor)

VM

Native

EACJ

Graph500with64Procs acorss 8NodesonChameleon

6%

4% 9%


ImpactonHPCandCloudCommunities

• DesignsavailablethroughMVAPICH2-Virtlibraryhttp://mvapich.cse.ohio-state.edu/download/mvapich/virt/mvapich2-virt-2.2-1.el7.centos.x86_64.rpm

• ComplexAppliancesavailableonChameleonCloud– MPIbare-metalcluster:https://www.chameleoncloud.org/appliances/29/

– MPI+SR-IOVKVMcluster:https://www.chameleoncloud.org/appliances/28/

• EnablesuserstoeasilyandquicklydeployHPCcloudsandperformjobswith

highperformance

• Enablesadministratorstoefficientlymanageandscheduleclusterresource


Conclusion

• AddresseskeyissuesonbuildingefficientHPCclouds

• OptimizesMPIcommunicationonvariousHPCclouds

• Presentsdesignsoflivemigrationtoprovidefault-toleranceonHPCclouds

• Presentsco-designswithresourcemanagementandschedulingsystems

• DemonstratesthecorrespondingbenefitsonmodernHPCclusters

• BroaderoutreachthroughMVAPICH2-VirtpublicreleasesandcomplexappliancesonChameleonCloudtestbed


Thank You! & Questions?

[email protected]

Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/

MVAPICHWebPagehttp://mvapich.cse.ohio-state.edu/

Designing and Building Efficient HPC Cloud with Modern ...

Documents

Transcript of Designing and Building Efficient HPC Cloud with Modern ...