Designing and Building Efficient HPC Cloud with Modern ...
Embed Size (px)
Transcript of Designing and Building Efficient HPC Cloud with Modern ...

DesigningandBuildingEfficientHPCCloudwithModernNetworkingTechnologies
onHeterogeneousHPCClusters
Jie Zhang
Dr.Dhabaleswar K.Panda(Advisor)
DepartmentofComputerScience&EngineeringTheOhioStateUniversity

SC2017DoctoralShowcase 2NetworkBasedComputingLaboratory
Outline
• Introduction
• ProblemStatement
• DetailedDesignsandResults
• ImpactonHPCCommunity
• Conclusion

SC2017DoctoralShowcase 3NetworkBasedComputingLaboratory
• CloudComputingfocusesonmaximizingtheeffectivenessofthesharedresources
• Virtualizationisthekeytechnologybehind
• Widelyadoptedinindustrycomputingenvironment
• IDCForecastsWorldwidePublicITCloudServicesspendingwillreach$195billionby2020(Courtesy:http://www.idc.com/getdoc.jsp?containerId=prUS41669516)
CloudComputingandVirtualization
VirtualizationCloudComputing

SC2017DoctoralShowcase 4NetworkBasedComputingLaboratory
DriversofModernHPCClusterandCloudArchitecture
• Multi-/Many-coretechnologies
• Accelerators(GPUs/Co-processors)
• Largememorynodes
• RemoteDirectMemoryAccess(RDMA)-enablednetworking(InfiniBandandRoCE)
• SingleRootI/OVirtualization(SR-IOV)
HighPerformanceInterconnects–InfiniBand(withSR-IOV)
<1useclatency,200GbpsBandwidth>Multi-/Many-core
Processors
Accelerators(GPUs/Co-processors)
Largememorynodes(Upto2TB)
Cloud CloudSDSCComet TACCStampede

SC2017DoctoralShowcase 5NetworkBasedComputingLaboratory
SingleRootI/OVirtualization(SR-IOV) isprovidingnewopportunitiestodesignHPCcloudwithverylittlelowoverheadthroughbypassinghypervisor
SingleRootI/OVirtualization(SR-IOV)
• Allowsasinglephysicaldevice,oraPhysicalFunction(PF),topresentitselfasmultiplevirtualdevices,orVirtualFunctions(VFs)
• VFsaredesignedbasedontheexistingnon-virtualizedPFs,noneedfordriverchange
• EachVFcanbededicatedtoasingleVMthroughPCIpass-through
4. Performance comparisons between IVShmem backed and native mode MPI li-braries, using HPC applications
The evaluation results indicate that IVShmem can improve point to point and collectiveoperations by up to 193% and 91%, respectively. The application execution time can bedecreased by up to 96%, compared to SR-IOV. The results further show that IVShmemjust brings small overheads, compared with native environment.
The rest of the paper is organized as follows. Section 2 provides an overview ofIVShmem, SR-IOV, and InfiniBand. Section 3 describes our prototype design and eval-uation methodology. Section 4 presents the performance analysis results using micro-benchmarks and applications, scalability results, and comparison with native mode. Wediscuss the related work in Section 5, and conclude in Section 6.
2 BackgroundInter-VM Shared Memory (IVShmem) (e.g. Nahanni) [15] provides zero-copy accessto data on shared memory of co-resident VMs on KVM platform. IVShmem is designedand implemented mainly in system calls layer and its interfaces are visible to user spaceapplications as well. As shown in Figure 2(a), IVShmem contains three components:the guest kernel driver, the modified QEMU supporting PCI device, and the POSIXshared memory region on the host OS. The shared memory region is allocated by hostPOSIX operations and mapped to QEMU process address space. The mapped memoryin QEMU can be used by guest applications by being remapped to user space in guestVMs. Evaluation results illustrate that both micro-benchmarks and HPC applicationscan achieve better performance with IVShmem support.
Qemu Userspace
Guest 1
Userspace
kernelPCI Device
mmap region
Qemu Userspace
Guest 2
Userspace
kernel
mmap region
Qemu Userspace
Guest 3
Userspace
kernelPCI Device
mmap region
/dev/shm/<name>
PCI Device
Host
mmap mmap mmap
shared mem fd
eventfds
(a) Inter-VM Shmem Mechanism [15]
Guest 1Guest OS
VF Driver
Guest 2Guest OS
VF Driver
Guest 3Guest OS
VF Driver
Hypervisor PF Driver
I/O MMU
SR-IOV Hardware
Virtual Function
Virtual Function
Virtual Function
Physical Function
PCI Express
(b) SR-IOV Mechanism [22]
Fig. 2. Overview of Inter-VM Shmem and SR-IOV Communication Mechanisms
Single Root I/O Virtualization (SR-IOV) is a PCI Express (PCIe) standard whichspecifies the native I/O virtualization capabilities in PCIe adapters. As shown in Fig-ure 2(b), SR-IOV allows a single physical device, or a Physical Function (PF), to presentitself as multiple virtual devices, or Virtual Functions (VFs). Each virtual device can bededicated to a single VM through the PCI pass-through, which allows each VM to di-rectly access the corresponding VF. Hence, SR-IOV is a hardware-based approach to

SC2017DoctoralShowcase 6NetworkBasedComputingLaboratory
DoesitsufficetobuildefficientHPCcloudwithonlySR-IOV?NO.
• Notsupportlocality-awarecommunication,co-locatedVMsstillhastouseSR-IOVchannel
• NotsupportVMmigrationbecauseofdevicepassthrough
• Notproperlymanageandisolatecriticalvirtualizedresource

SC2017DoctoralShowcase 7NetworkBasedComputingLaboratory
• CanMPIruntimeberedesignedtoprovidevirtualizationsupportforVMs/ContainerswhenbuildingHPCclouds?
• HowmuchbenefitscanbeachievedonHPCcloudswithredesignedMPIruntimeforscientifickernelsandapplications?
• Canfault-tolerance/resilience(LiveMigration)besupportedonSR-IOVenabledHPCclouds?
• Canweco-designwithresourcemanagementandschedulingsystemstoenableHPCcloudsonmodernHPCsystems?
Problem Statements

SC2017DoctoralShowcase 8NetworkBasedComputingLaboratory
ResearchFramework

SC2017DoctoralShowcase 9NetworkBasedComputingLaboratory
• HighPerformanceopen-sourceMPILibraryforInfiniBand,Omni-Path,Ethernet/iWARP,andRDMAoverConvergedEthernet(RoCE)– MVAPICH(MPI-1),MVAPICH2(MPI-2.2andMPI-3.0),Startedin2001,Firstversionavailablein2002
– MVAPICH2-X(MPI+PGAS),Availablesince2011
– SupportforGPGPUs(MVAPICH2-GDR)andMIC(MVAPICH2-MIC),Availablesince2014
– SupportforVirtualization(MVAPICH2-Virt),Availablesince2015– SupportforEnergy-Awareness(MVAPICH2-EA),Availablesince2015
– SupportforInfiniBandNetworkAnalysisandMonitoring(OSUINAM)since2015
– Usedbymorethan2,825organizationsin85countries
– Morethan432,000(>0.4million)downloadsfromtheOSUsitedirectly– EmpoweringmanyTOP500clusters(Jul‘17ranking)
• 1st ranked10,649,640-corecluster(SunwayTaihuLight)atNSC,Wuxi,China
• 15th ranked241,108-corecluster(Pleiades)atNASA
• 20th ranked522,080-corecluster(Stampede)atTACC
• 44th ranked74,520-corecluster(Tsubame2.5)atTokyoInstituteofTechnologyandmanyothers
– AvailablewithsoftwarestacksofmanyvendorsandLinuxDistros(RedHatandSuSE)
– http://mvapich.cse.ohio-state.edu
MVAPICH2Project

SC2017DoctoralShowcase 10NetworkBasedComputingLaboratory
Locality-awareMPICommunicationwithSR-IOVandIVShmemApplication
MPI Layer
ADI3 Layer
SMP Channel
Network Channel
Shared Memory
InfiniBand API
MPI Library
CommunicationDevice APIs
Native Hardware
Application
MPI Layer
ADI3 Layer
IVShmem Channel
SR-IOV Channel
Shared Memory InfiniBand API
Virtual MachineAware
CommunicationDevice APIs
Virtualized Hardware
Communication Coordinator
Locality Detector
SMP Channel
• MPIlibrary running in native and virtualization environments• Invirtualizedenvironment
- Supportshared-memorychannels(SMP,IVShmem)andSR-IOVchannel- Localitydetection- Communicationcoordination- Communicationoptimizationsondifferentchannels(SMP,IVShmem,SR-IOV;RC,UD)
J.Zhang,X.Lu,J.JoseandD.K.Panda,HighPerformanceMPILibraryoverSR-IOVEnabledInfiniBandClusters,TheInternationalConferenceonHighPerformanceComputing(HiPC’14),Dec2014

SC2017DoctoralShowcase 11NetworkBasedComputingLaboratory
Application Performance(NAS&P3DFFT)
• Proposed design delivers up to 43% (IS) improvement for NAS• Proposed design brings 29%,33%,29%and20% improvement for INVERSE,RAND,
SINEandSPEC
0
2
4
6
8
10
12
14
16
18
FT LU CG MG IS
ExecutionTime(s)
NAS B Class
SR-IOV
Proposed
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
SPEC INVERSE RAND SINE
ExecutionTimes
(s)
P3DFFT 512*512*512
SR-IOV
Proposed
NAS-32 VMs (8 VMs per node) P3DFFT-32 VMs (8 VMs per node)
43%
20%29%
33% 29%

SC2017DoctoralShowcase 12NetworkBasedComputingLaboratory
SR-IOV-enabledVMMigrationSupportonHPCClouds

SC2017DoctoralShowcase 13NetworkBasedComputingLaboratory
HighPerformanceSR-IOVenabledVMMigrationFrameworkforMPIApplications
MPI
Host
Guest VM1
Hypervisor
IBAdapterIVShmem
Network Suspend Trigger
Read-to-Migrate Detector
Network ReactiveNotifier
Controller
MPIGuest VM2
MPI
Host
Guest VM1
VF /SR-IOV
Hypervisor
IBAdapter IVShmemEthernet
Adapter
MPIGuest VM2
EthernetAdapter
Migration Done
Detector
VF /SR-IOV
VF /SR-IOV
VF /SR-IOV
Migration Trigger
J.Zhang,X.Lu,D.K.Panda.High-PerformanceVirtualMachineMigrationFrameworkforMPIApplicationsonSR-IOVenabledInfiniBandClusters.IPDPS,2017
• TwoChallenges
1. Detach/re-attachvirtualizeddevices
2. MaintainIBConnection
• Challenge1:MultipleparallellibrariestocoordinatewithVMduringmigration(detach/reattachSR-IOV/IVShmem,migrateVMs,migrationstatus)
• Challenge2:MPIruntimehandlesIBconnection suspendingandreactivating
• ProposeProgressEngine(PE)andMigrationThreadbased(MT)designtooptimizeVMmigrationandMPIapplicationperformance

SC2017DoctoralShowcase 14NetworkBasedComputingLaboratory
ProposedDesignofMPIRuntime
Time Comp
MPICall
RTime
RSMPI Call Suspend Channel Reactivate Channel Control Msg
Migration Migration
Lock/Unlock Communication
Comp Computation
P 0
ControllerPre-migration
S
Ready-to-MigrateMigration
Post-Migration
MPI Call
MPICall
TimeNo-migration P 0 Comp
MPICall
P 0
MPICall
ThreadPre-
migrationReady-to-migrate Migration
Post-migration
MPICall
Comp
Controller
TimeCompMigration-thread based Worst Scenario
P 0
MPICall
Thread Pre-migration
SReady-to-migrate Migration
Post-migrationController
S R
R
Migration-thread based Typical Scenario
Progress Engine Based
Migration Signal Detection Down-Time for VM Live Migration
MPI Call

SC2017DoctoralShowcase 15NetworkBasedComputingLaboratory
• 8VMsintotaland1VMcarriesoutmigrationduringapplicationrunning
• ComparedwithNM,MT- worstandPEincursomeoverhead
• MT-typicalallowsmigrationtobecompletelyoverlappedwithcomputation
ApplicationPerformance
051015202530
LU.B EP.B MG.B CG.B FT.B
ExecutionTime(s) PE
MT-worstMT-typicalNM
0.0
0.1
0.2
0.3
0.4
20,10 20,16 20,20 22,10
ExecutionTime(s) PE
MT-worstMT-typicalNM
Graph500NAS

SC2017DoctoralShowcase 16NetworkBasedComputingLaboratory
HighPerformanceMPICommunicationforNestedVirtualization
QPI
Mem
ory
Con
trol
ler
core 4 core 5 core 6 core 7
NUMA 0
VM 0Container 0
Mem
ory
Con
trol
lercore 0 core 1 core 2 core 3 core 8 core 9 core 10 core 11
core 12 core 13 core 14 core 15
VM 1Container 2Container 1 Container 3
1 23 4
Two-Layer NUMA Aware Communication Coordinator
NUMA 1
Container Locality Detector VM Locality Detector
Nested Locality Combiner
Two-Layer Locality Detector
CMAChannel
SHared Memory (SHM) Channel
Network (HCA)Channel
Two-LayerLocalityDetector:DynamicallydetectingMPIprocessesintheco-residentcontainersinsideoneVMaswellastheonesintheco-residentVMs
Two-LayerNUMAAwareCommunicationCoordinator: Leveragenestedlocalityinfo,NUMAarchitectureinfoandmessagetoselectappropriatecommunicationchannel
J.Zhang,X.LuandD.K.Panda,DesigningLocalityandNUMAAwareMPIRuntimeforNestedVirtualizationbasedHPCCloudwithSR-IOVEnabledInfiniBand,The13thACMSIGPLAN/SIGOPSInternationalConferenceonVirtualExecutionEnvironments(VEE’17),April2017

SC2017DoctoralShowcase 17NetworkBasedComputingLaboratory
Two-LayerNUMAAwareCommunicationCoordinator
NUMA Loader Nested Locality Loader
Message Parser
Communication Coordinator
Two-Layer NUMA Aware Communication Coordinator
CMA Channel SHared Memory (SHM) Channel
Network (HCA)Channel
Two-Layer Locality Detector
• NestedLocalityLoaderreadslocalityinfoofdestinationprocessfromTwo-LayerLocalityDetector
• NUMALoaderreadsinfoofVM/containerplacementstodecideonwhichNUMAnodethedestinationprocessispinning
• MessageParserobtainsmessageattributes,e.g.,messagetypeandmessagesize

SC2017DoctoralShowcase 18NetworkBasedComputingLaboratory
ApplicationsPerformance
• 256processesacross64containerson16nodes• ComparedwithDefault,enhanced-hybriddesignreducesupto16% (28,16)and10% (LU)of
executiontimeforGraph500andNAS,respectively• Comparedwiththe1Layercase,enhanced-hybriddesignalsobringsupto12%(28,16)and6% (LU)
performancebenefit.
0
50
100
150
200
IS MG EP FT CG LU
ExecutionTime(s)
ClassDNASDefault
1Layer
2Layer-Enhanced-Hybrid
0
2
4
6
8
10
22,20 24,16 24,20 24,24 26,16 26,20 26,24 28,16
BFSExecutionTime(s)
Graph500Default
1Layer
2Layer-Enhanced-Hybrid

SC2017DoctoralShowcase 19NetworkBasedComputingLaboratory
Compu
teNod
es
MPI MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
MPI
VM VM
VMVM
VM
VM
VM VM
ExclusiveAllocationsSequentialJobs(EASJ)
ExclusiveAllocationsConcurrentJobs
(EACJ)
Shared-hostAllocationsConcurrentJobs(SACJ)
Typical UsageScenarios

SC2017DoctoralShowcase 20NetworkBasedComputingLaboratory
Submit Job SLURMctld
VM Configuration File
physical node
SLURMd
SLURMd
VM Launching/Reclaiming
libvirtd
VM1
VF IVSHMEM
VM2
VF IVSHMEM
physical node
SLURMdphysical
node
SLURMd
sbatch File
MPI MPI
physical resource request
physical node list
launch VMs
Lustre
Image Pool
1. SR-IOV virtual function
2. IVSHMEM device3. Network setting4. Image management5. Launching VMs and
check availability6. Mount global storage,
etc.
….
Slurm-VArchitectureOverview

SC2017DoctoralShowcase 21NetworkBasedComputingLaboratory
AlternativeDesignsofSlurm-V
• Slurm SPANKPluginbaseddesign– UtilizeSPANKplugintoreadVMconfiguration,launch/reclaimVM
– FilebasedlocktodetectoccupiedVFandexclusivelyallocatefreeVF
– AssignauniqueIDtoeachIVSHMEMdeviceanddynamicallyattachtoeachVM
– InheritadvantagesfromSlurm: coordination,scalability,security
• Slurm SPANKPluginoverOpenStackbaseddesign– OffloadVMlaunch/reclaimtounderlyingOpenStackframework
– PCIWhitelisttopassthrough freeVFtoVM
– ExtendNovatoenableIVSHMEMwhenlaunchingVM
– InheritadvantagefrombothOpenStackandSlurm: componentoptimization,performance

SC2017DoctoralShowcase 22NetworkBasedComputingLaboratory
0
50
100
150
200
250
22,10 22,16 22,20ProblemSize(Scale,Edgefactor)
VM
Native
• 32VMsacross8nodes,6Cores/VM
• EASJ- ComparedtoNative,lessthan4%overhead
• SACJ,EACJ– lessthan9%overhead,whenrunningNASasconcurrentjobwith64Procs
EASJ SACJ
ApplicationsPerformance
0
500
1000
1500
2000
2500
3000
24,16 24,20 26,10
BFSExecutionTime(m
s)
ProblemSize(Scale,Edgefactor)
VM
Native
0
50
100
150
200
250
2210 2216 2220ProblemSize(Scale,Edgefactor)
VM
Native
EACJ
Graph500with64Procs acorss 8NodesonChameleon
6%
4% 9%

SC2017DoctoralShowcase 23NetworkBasedComputingLaboratory
ImpactonHPCandCloudCommunities
• DesignsavailablethroughMVAPICH2-Virtlibraryhttp://mvapich.cse.ohio-state.edu/download/mvapich/virt/mvapich2-virt-2.2-1.el7.centos.x86_64.rpm
• ComplexAppliancesavailableonChameleonCloud– MPIbare-metalcluster:https://www.chameleoncloud.org/appliances/29/
– MPI+SR-IOVKVMcluster:https://www.chameleoncloud.org/appliances/28/
• EnablesuserstoeasilyandquicklydeployHPCcloudsandperformjobswith
highperformance
• Enablesadministratorstoefficientlymanageandscheduleclusterresource

SC2017DoctoralShowcase 24NetworkBasedComputingLaboratory
Conclusion
• AddresseskeyissuesonbuildingefficientHPCclouds
• OptimizesMPIcommunicationonvariousHPCclouds
• Presentsdesignsoflivemigrationtoprovidefault-toleranceonHPCclouds
• Presentsco-designswithresourcemanagementandschedulingsystems
• DemonstratesthecorrespondingbenefitsonmodernHPCclusters
• BroaderoutreachthroughMVAPICH2-VirtpublicreleasesandcomplexappliancesonChameleonCloudtestbed

SC2017DoctoralShowcase 25NetworkBasedComputingLaboratory
Thank You! & Questions?
Network-BasedComputingLaboratoryhttp://nowlab.cse.ohio-state.edu/
MVAPICHWebPagehttp://mvapich.cse.ohio-state.edu/