SIGNALS - Prasad V. Potluri Siddhartha Institute of Technology
NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer,...
Transcript of NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer,...
![Page 1: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/1.jpg)
Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA
S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY FOR NVIDIA GPU CLUSTERS
![Page 2: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/2.jpg)
2
AGENDA
GPU Programming Models
Overview of NVSHMEM
Porting to NVSHMEM
Future Work
Conclusion and Future Work
![Page 3: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/3.jpg)
3
Compute on GPU
Communication from CPU
Synchronization at boundaries
Offload latencies in critical path
Hiding increases code complexity
GPU FOR COMPUTE OFFLOAD
cuda_kernel<<<>>>
GPU CPU PCIe/Network
MPI_Isend
MPI_Wait
cudaStreamSynchronize<<<>>>
![Page 4: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/4.jpg)
4
GPU MASTERS COMMUNICATION
cuda_kernel<<<>>>
GPUCPU Network
cudaStreamSynchronize<<<>>>Avoids offload latencies
Compute – communication overlap
Easier to express algorithms with inline communication
Improving performance while making it easier to program
shmem_put
shmem_quiet
shmem_put
shmem_put
shmem_put
shmem_put
![Page 5: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/5.jpg)
5
AGENDA
GPU Programming Models
Overview of NVSHMEM
Porting to NVSHMEM
Future Work
Conclusion and Future Work
![Page 6: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/6.jpg)
6
WHAT IS NVSHMEM ?
Experimental implementation of OpenSHMEM for NVIDIA GPUs, 1 PE/GPU
shared memory: shmem_malloc
private memory: cudaMalloc
shmem communication APIs: shared->shared or private->shared
![Page 7: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/7.jpg)
7
DEVICE-INITIATED COMMUNICATION
PE i
PE i-1
PE i+1
Thread-level communication APIs
Allow finer grained control and overlap
Maps well onto NVLink fabric – DGX-1/DGX-2
__global__ void stencil_single_step (float *u, …)
{
int ix = threadIdx.x, iy = threadIdx.y;
//compute
//data exchange
if (iy == ny) {
shmem_float_p (u + ny*nx + ix, u + ix, top_pe);
}
if (iy == 1) {
shmem_float_p (u + nx + ix, u + (ny+1)*nx + ix, bottom_pe);
}
}
ny
nx
![Page 8: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/8.jpg)
8
THREAD-GROUP COMMUNICATION
PE i
PE i-1
PE i+1
Operations can be issued by a WARP/CTA
Coarser, hence more efficient transfers over networks like IB
Still allows inter-warp/inter-block overlap
__global__ void stencil_single_step (u, …)
{
//compute
//data exchange
shmem_float_put_block_nbi (u + ny*nx, u, nx, top_pe);
shmem_float_put_block_nbi (u + nx, u + (ny+1)*nx, nx, bottom_pe);
}
![Page 9: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/9.jpg)
9
IN-KERNEL SYNCHRONIZATION
Allows inter-PE synchronization
Can offload larger portions of application running CUDA kernels
PE i
PE i-1
PE i+1
Data transfer +Synchronization
+
__global__ void stencil_uber (u, …)
{
while (iter=0; iter<N; iter++) {
//compute
//data exchange
shmem_float_put_nbi_block (u + ny*nx, u, nx, top_pe);
shmem_float_put_nbi_block (u + nx, u + (ny+1)*nx, nx, bottom_pe);
shmem_barrier_all();
}
}
![Page 10: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/10.jpg)
10
COLLECTIVE KERNEL LAUNCH
GPUs supported CUDA kernel launch
Device-Initiated
CommunicationKepler or newer
regular <<<>>> or
launch APIs
Device-Initiated
SynchronizationVolta or newer shmemx_collective_launch
Provides progress when using device-side inter-kernel synchronization
Built on CUDA cooperative launch and requirement of 1PE/GPU
![Page 11: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/11.jpg)
11
STREAM-ORDERED OPERATIONS
Not optimal to move all communication/synchronization into CUDA kernels
Inter-CTA synchronization latencies can be longer than kernel launch latencies
Allows mixing fine-grained communication + coarse-grained synchronization
GPU 0 GPU 0PCIe/Network
shmem_barrier_all_on_stream
![Page 12: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/12.jpg)
12
INTRA-NODE IMPLEMENTATION
NVLink or PCIe
uses CUDA IPC under the hood
shmem_put/get on device
ld/store
shmem_put/get_on_stream
cudaMemcpyAsync
GPU 0
VirtualAddress
PhysicalAddress
GPU 1 GPU 2
VirtualAddress
PhysicalAddress
VirtualAddress
PhysicalAddress
![Page 13: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/13.jpg)
13
MULTI-NODE SUPPORT
Reverse offloads network transfers to the CPU
Avoids memory fences when signaling CPU
Uses standard IB verbs (Mellanox OFED for GPUDirect RDMA)
GPU
CPU Network
Proxy
ring-buffer IB QP
![Page 14: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/14.jpg)
14
NVSHMEM STATUS
Research vehicle for designing and evaluating GPU-centric workloads
Early access (EA2) available – please reach out to [email protected]
Main Features
NVLink and PCIe support
InfiniBand support (new)
X86 and Power9 (new) support
Interoperability with MPI and OpenSHMEM (new) libraries
![Page 15: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/15.jpg)
15
AGENDA
GPU Programming Models
Overview of NVSHMEM
Porting to NVSHMEM
Future Work
Conclusion and Future Work
![Page 16: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/16.jpg)
16
PORTING TO USE NVSHMEM FROM GPU
Step I : Only communication from inside the kernel (on Kepler or newer GPUs)
Step II : Both communication and synchronization from inside the kernel (on Pascal or newer Tesla GPUs)
Using Jacobi Solver, we will walk through I and II and compare with MPI version
Code available at : github.com/NVIDIA/multi-gpu-programming-models
GTC 2019 S9139 Multi-GPU Programming Models, Jiri Kraus - Senior Devtech Compute, NVIDIA
![Page 17: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/17.jpg)
17
EXAMPLE: JACOBI SOLVER
While not converged
Do Jacobi step:
for( int iy = 1; iy < ny-1; iy++ )
for( int ix = 1; ix < nx-1; ix++ )
a_new[iy*nx+ix] = -0.25 *
-( a[ iy *nx+(ix+1)] + a[ iy *nx+ix-1]
+ a[(iy-1)*nx+ ix ] + a[(iy+1)*nx+ix ] );
Apply periodic boundary conditions
Swap a_new and a
Next iteration
![Page 18: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/18.jpg)
18
COMPUTE KERNEL – SINGLE GPU
__global__ void jacobi_kernel( ... ) {
const int ix = bIdx.x*bDim.x+tIdx.x;
const int iy = bIdx.y*bDim.y+tIdx.y + iy_start;
real local_l2_norm = 0.0;
if ( iy < iy_end && ix >= 1 && ix < (nx-1) ) {
const real new_val = 0.25 * ( a[ iy * nx + ix + 1 ] + a[ iy * nx + ix - 1 ]
+ a[ (iy+1) * nx + ix ] + a[ (iy-1) * nx + ix ] );
a_new[ iy * nx + ix ] = new_val;
real residue = new_val - a[ iy * nx + ix ];
local_l2_norm += residue * residue;
}
atomicAdd( l2_norm, local_l2_norm ); }}
}
github.com/NVIDIA/multi-gpu-programming-models/tree/master/single_gpu
![Page 19: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/19.jpg)
19
HOST CODE - MPItop_stream bottom_stream compute_stream
cudaMemsetAsync(norm)
cudaRecordEvent (event0)
cudaStreamWaitEvent(event0)
compute_jacobi<<>>>(bottom_boundary)
cudaStreamWaitEvent(event0)
compute_jacobi<<>>>(top_boundary)
compute_jacobi<<>>>(interior)
cudaStreamSynchronize() cudaStreamSynchronize()
MPI_SendRecv(top) MPI_SendRecv(bottom)
cudaRecordEvent (event1)cudaRecordEvent (event2)
cudaStreamWaitEvent(event1)
cudaStreamWaitEvent(event2)
cudaMemcpyAsync(norm)
MPI_Allreduce(norm)
Once every n
iterations
![Page 20: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/20.jpg)
20
HOST CODE - MPIwhile (iter < iter_max ) {while ( l2_norm > tol && iter < iter_max )
{
//reset norm
CUDA_RT_CALL( cudaMemsetAsync(l2_norm_d, 0 , sizeof(real), compute_stream ) );
CUDA_RT_CALL( cudaEventRecord( reset_l2norm_done, compute_stream ) );
//compute boundary
CUDA_RT_CALL( cudaStreamWaitEvent( push_top_stream, reset_l2norm_done, 0 ) );
launch_jacobi_kernel( a_new, a, l2_norm_d, iy_start, (iy_start+1), nx, push_top_stream );
CUDA_RT_CALL( cudaEventRecord( push_top_done, push_top_stream ) )
CUDA_RT_CALL( cudaStreamWaitEvent( push_bottom_stream, reset_l2norm_done, 0 ) );
launch_jacobi_kernel( a_new, a, l2_norm_d, (iy_end-1), iy_end, nx, push_bottom_stream );
CUDA_RT_CALL( cudaEventRecord( push_bottom_done, push_bottom_stream ) );
//compute interior
launch_jacobi_kernel( a_new, a, l2_norm_d, (iy_start+1), (iy_end-1), nx, compute_stream );
//Apply periodic boundary conditions
CUDA_RT_CALL( cudaStreamSynchronize( push_top_stream ) );
MPI_CALL( MPI_Sendrecv( a_new+iy_start*nx, nx, MPI_REAL_TYPE, top , 0, a_new+(iy_end*nx), nx, MPI_REAL_TYPE, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ));
CUDA_RT_CALL( cudaStreamSynchronize( push_bottom_stream ) );
MPI_CALL( MPI_Sendrecv( a_new+(iy_end-1)*nx, nx, MPI_REAL_TYPE, bottom, 0, a_new, nx, MPI_REAL_TYPE, top, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ));
//Periodic convergence check
if ( (iter % nccheck) == 0 || (!csv && (iter % 100) == 0) ) {
CUDA_RT_CALL( cudaStreamWaitEvent( compute_stream, push_top_done, 0 ) );
CUDA_RT_CALL( cudaStreamWaitEvent( compute_stream, push_bottom_done, 0 ) );
CUDA_RT_CALL( cudaMemcpyAsync( l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost, compute_stream ) );
CUDA_RT_CALL( cudaStreamSynchronize( compute_stream ) );
MPI_CALL( MPI_Allreduce( l2_norm_h, &l2_norm, 1, MPI_REAL_TYPE, MPI_SUM, MPI_COMM_WORLD ) );
l2_norm = std::sqrt( l2_norm );
}
github.com/NVIDIA/multi-gpu-programming-models/tree/master/mpi_overlapp
![Page 21: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/21.jpg)
21
CUDA KERNEL - NVSHMEM FOR COMMS
__global__ void jacobi_kernel( ... ) {
const int ix = bIdx.x*bDim.x+tIdx.x;
const int iy = bIdx.y*bDim.y+tIdx.y + iy_start;
real local_l2_norm = 0.0;
if ( iy < iy_end && ix >= 1 && ix < (nx-1) ) {
const real new_val = 0.25 * ( a[ iy * nx + ix + 1 ] + a[ iy * nx + ix - 1 ]
+ a[ (iy+1) * nx + ix ] + a[ (iy-1) * nx + ix ] );
a_new[ iy * nx + ix ] = new_val;
if ( iy_start == iy )
shmem_float_p(a_new + top_iy*nx + ix, new_val, top_pe);
if ( iy_end == iy )
shmem_float_p(a_new + bottom_iy*nx + ix, new_val, bottom_pe);
real residue = new_val - a[ iy * nx + ix ];
}
atomicAdd( l2_norm, local_l2_norm ); }}
}
github.com/NVIDIA/multi-gpu-programming-models/tree/master/nvshmem
![Page 22: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/22.jpg)
22
HOST CODE - NVSHMEM FOR COMMS
a = (real *) shmem_malloc(nx*(chunk_size+2)*sizeof(real));
a_new = (real *) shmem_malloc(nx*(chunk_size+2)*sizeof(real));
…
while (iter < iter_max && l2_norm > tol ) {
…
jacobi_kernel<<<dim_grid,dim_block,0,compute_stream>>>(
a_new, a, l2_norm_d, iy_start, iy_end, nx,
top, iy_end_top, bottom, iy_start_bottom );
shmemx_barrier_all_on_stream(compute_stream);
//convergence check
if ((iter % nccheck) == 0) {
cudaMemcpyAsync( l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost, compute_stream ) );
cudaStreamSynchronize( compute_stream );
MPI_Allreduce( l2_norm_h, &l2_norm, 1, MPI_REAL_TYPE, MPI_SUM, MPI_COMM_WORLD );
l2_norm = std::sqrt( l2_norm );
}
}
…
github.com/NVIDIA/multi-gpu-programming-models/tree/master/nvshmem
![Page 23: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/23.jpg)
23
__global__ void jacobi_uber_kernel( ... ) {
grid_group g = this_grid();
//comms only ...
g.sync();
if ( (iter % nccheck) == 0 ) {
//reduction across shmem pes
if (!tid_x && !tid_y) {
shmem_barrier_all();
shmem_float_sum_to_all (l2_norm + 1, l2_norm, 1, 0, 0, npes, NULL, NULL);
l2_norm[1] = (float) __frsqrt_rn( (float)l2_norm[1] );
l2_norm[0] = 0;}}
g.sync();
}
CUDA KERNEL - NVSHMEM FOR COMMS + SYNC
![Page 24: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/24.jpg)
24
a = (real *) shmem_malloc(nx*(chunk_size+2)*sizeof(real));
a_new = (real *) shmem_malloc(nx*(chunk_size+2)*sizeof(real));
…
void *args[] = {&a_new, &a, &l2_norm_d, &iy_start, &iy_end, &nx,
&top, &iy_end_top, &bottom, &iy_start_bottom};
shmemx_collective_launch ( jacobi_kernel, dim_grid,
dim_block, 0, compute_stream);
…
HOST CODE - NVSHMEM FOR COMMS + SYNC
![Page 25: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/25.jpg)
25
JACOBI SOLVER (ON X86)
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Para
llel Eff
icie
ncy
#GPUs
MPI Overlap NVSHMEM
DGX-2 (1 node) - 18432 x 18432, 1000 iterations
Benchmarksetup: DGX-2 with OS 4.0.5, GCC 7.3.0, CUDA 10.0 with 410.104 Driver, CUB 1.8.0, CUDA-aware OpenMPI 4.0.0, NVSHMEM EA2 (0.2.3),GPUs@1597Mhz AC, Reported Runtime is the minimum of 5 repetitions
![Page 26: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/26.jpg)
Multi-GPU transpose : USING NVSHMEMJACOBI SOLVER (ON POWER9)
4 GP100s connected with NVLink 0.0x
0.5x
1.0x
1.5x
2.0x
2.5x
3.0x
3.5x
4.0x
4.5x
5.0x
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 6 12 24 48 96 192 384
Par
alle
l Eff
icie
ncy
# GPUs
mpi
nvshmem
ratio
Summit (64 nodes) - 16384 x 16384, 1000 iterations
Benchmarksetup: Summit with RHEL 4.14.0, GCC 4.8.5, CUDA 9.2.148 with 396.64 Driver, CUB 1.8.0, CUDA-aware OpenMPI 4.0.0, NVSHMEM EA2 (0.2.3)
![Page 27: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/27.jpg)
27
AGENDA
GPU Programming Models
Overview of NVSHMEM
Porting to NVSHMEM
Future Work
Conclusion and Future Work
![Page 28: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/28.jpg)
28
IN-HEADER IMPLEMENTATION
Put/Get translates to LD/ST on NVLink/PCIe
Each API results in
• Function call overhead
• Remote address recalculation
Avoid through in-header implementation0
10
20
30
40
50
60
GB
/s
Bytes
shmem_p_bw : implementation in library
shmem_p_bw : implementation in header
shmem_st_bw : application does store to memory
Benchmarksetup: DGX-1 with OS 4.14.0, GCC 5.4.0, CUDA 10.0.130 with 418.39 Driver
![Page 29: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/29.jpg)
29
COLLECTIVES OPTIMIZATION
Now use direct all-to-all communication
Works well over NVlink in a DGX-1/2
Limits scalability inter-node
Improving implementations for device-side ops
Leverage NCCL goodness for CPU/on-stream ops 0x
1x
2x
3x
4x
5x
6x
7x
8x
9x
0
50
100
150
200
250
300
350
400
450
4 8 16 32 64 128 256
Tim
e (
mic
rose
con
ds)
# GPUs
Alltoall (EA2)
Dissemination
Ratio
Benchmark setup: Summit with RHEL 4.14.0, GCC 4.8.5, CUDA 9.2.148 with 396.64 Driver
Barrier Latency
![Page 30: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/30.jpg)
30
OTHERS
Interoperability
Tied to OpenMPI for MPI interoperability
To make MPI-based bootstrap an external module
Strided transfers – multi-dimensional decomposition
![Page 31: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/31.jpg)
31
SUMMARY
Allows design and experimentation with GPU-initiated communication
Early access (EA2) available
Support for P2P and Infiniband connected GPUs
x86 and P9 support
Interoperable with MPI and OpenSHMEM
Please reach out to [email protected]
application use case, how GPU-initiated may help
![Page 32: NVSHMEM: A Partitioned Global Address Space Library for ... · Anshuman Goswami, Akhil Langer, Sreeram Potluri, NVIDIA S9677 - NVSHMEM: A PARTITIONED GLOBAL ADDRESS SPACE LIBRARY](https://reader034.fdocuments.us/reader034/viewer/2022042107/5e871baeae24a205fb18abce/html5/thumbnails/32.jpg)