Mark Harris, November 1, 2017 - GPU Technology...
Transcript of Mark Harris, November 1, 2017 - GPU Technology...
![Page 1: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/1.jpg)
May 8-11, 2017 | Silicon Valley
Mark Harris, November 1, 2017
CUDA 9 AND BEYOND
![Page 2: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/2.jpg)
22
INTRODUCING CUDA 9
Tesla V100New GPU ArchitectureTensor CoresNVLinkIndependent Thread Scheduling
BUILT FOR VOLTA
COOPERATIVE THREAD GROUPS
Flexible Thread GroupsEfficient Parallel AlgorithmsSynchronize Across Thread Blocks in a Single GPU or Multi-GPUs
cuBLAS for Deep LearningNPP for Image ProcessingcuFFT for Signal Processing
FASTER LIBRARIES
DEVELOPER TOOLS & PLATFORM UPDATES
Faster Compile TimesUnified Memory ProfilingNVLink VisualizationNew OS and Compiler Support
partition
sync sync
![Page 3: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/3.jpg)
33
INTRODUCING TESLA V100
The Fastest and Most Productive GPU for Deep Learning and HPC
Volta Architecture
Most Productive GPU
Tensor Core
125 Programmable TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink & HBM2
Efficient Bandwidth
![Page 4: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/4.jpg)
44
ROAD TO EXASCALEVolta to Fuel Most Powerful
US Supercomputers
1.64
1.501.39 1.41 1.37
1.7
1.41.5
V100
Per
form
ance
Rela
tive
to
P100
1.5x HPC Performance in 1 Year
System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 2X Tesla P100 or V100.
Summit Supercomputer200+ PetaFlops~3,400 Nodes10 Megawatts
![Page 5: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/5.jpg)
55
FASTER LIBRARIES
![Page 6: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/6.jpg)
66
CUDA 9: WHAT’S NEW IN LIBRARIES
VOLTA PLATFORM SUPPORT PERFORMANCE
IMPROVED USER EXPERIENCENEW ALGORITHMS
Utilize Volta Tensor Cores
Volta optimized GEMMs (cuBLAS)
Out-of-box performance on Volta (all libraries)
GEMM optimizations for RNNs (cuBLAS)
Faster image processing (NPP)
FFT optimizations across various sizes (cuFFT)
Multi-GPU dense & sparse solvers, dense eigenvalue & SVD (cuSOLVER)
Breadth first search, clustering, triangle counting, extraction & contraction (nvGRAPH)
New install package for CUDA Libraries (library-only meta package)
Modular NPP with small footprint, support for image batching
DEEP LEARNING
Scientific Computing
![Page 7: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/7.jpg)
77
0
1
2
3
4
5
6
7
8
9
10
512 1024 2048 4096
RelativePerfo
rmance
MatrixSize(M=N=K)
cuBLAS MixedPrecision(FP16Input,FP32compute)
P100(CUDA8) V100TensorCores(CUDA9)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
512 1024 2048 4096
RelativePerfo
rmance
MatrixSize(M=N=K)
cuBLAS SinglePrecision(FP32)
P100(CUDA8) V100(CUDA9)
cuBLAS GEMMS FOR DEEP LEARNINGV100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply
9.3x1.8x
Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.
![Page 8: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/8.jpg)
88
COOPERATIVE GROUPS
![Page 9: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/9.jpg)
99
COOPERATIVE GROUPSFlexible and Scalable Thread Synchronization and Communication
Define, synchronize, and partition groups of cooperating threads
Clean composition across software boundaries
Optimize for hardware fast path
Scalable from a few threads to all running threads
Deploy Everywhere: Kepler and Newer GPUs
Supported by CUDA developer tools
Thread Block Group
Partitioned Thread Groups
![Page 10: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/10.jpg)
1010
SYNCHRONIZE AT ANY SCALEThree Key Capabilities
FLEXIBLE GROUPS
Define and Synchronize Arbitrary
Groups of Threads
partition
sync sync
WHOLE-GRID SYNCHRONIZATION
Synchronize Multiple Thread Blocks
sync
MULTI-GPU SYNCHRONIZATION
sync
* Note: Multi-Block and Mult-Device Cooperative Groups are only supported on Pascal and above GPUs
![Page 11: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/11.jpg)
1111
COOPERATIVE GROUPS BASICSFlexible, Explicit Synchronization
Thread groups are explicit objects in your program
You can synchronize threads in a group
Create new groups by partitioning existing groups
Partitioned groups can also synchronize
thread_group block = this_thread_block();
block.sync();
thread_group tile32 = tiled_partition(block, 32);thread_group tile4 = tiled_partition(tile32, 4);
tile4.sync();Note: calls in green are part of the cooperative_groups:: namespace
Thread Block Group
Partitioned Thread Groups
![Page 12: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/12.jpg)
1212
EXAMPLE: PARALLEL REDUCTIONComposable, Robust and Efficient
__device__ int reduce(thread_group g, int *x, int val) { int lane = g.thread_rank();for (int i = g.size()/2; i > 0; i /= 2) {x[lane] = val; g.sync();val += x[lane + i]; g.sync();
}return val;
}
g = tiled_partition<32>(this_thread_block());reduce(g, ptr, myVal);
g = this_thread_block();reduce(g, ptr, myVal);
Per-Block Per-Warp
![Page 13: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/13.jpg)
1313
LAUNCHING COOPERATIVE KERNELSThree Synchronization Scales
Block or Sub-Block Sync
Launch with <<<>>> orcudaLaunchKernel()
Multi-Device Sync Launch with cudaLaunchCooperativeKernelMultiDevice()
Multi-Block Sync Launch with cudaLaunchCooperativeKernel()
![Page 14: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/14.jpg)
1414
EXAMPLE: PARTICLE SIMULATIONWithout Cooperative Groups
0 1 2 3
4 5 67
// threads update particles in parallelintegrate<<<blocks, threads, 0, stream>>>(particles);
![Page 15: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/15.jpg)
1515
EXAMPLE: PARTICLE SIMULATIONWithout Cooperative Groups
// threads update particles in parallelintegrate<<<blocks, threads, 0, s>>>(particles);
// Collide each particle with others in neighborhoodcollide<<<blocks, threads, 0, s>>>(particles);
0 1 2 3
5 6 7
4
Note change in how threads map to particles in acceleration data structure
![Page 16: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/16.jpg)
1616
EXAMPLE: PARTICLE SIMULATIONWithout Cooperative Groups
// threads update particles in parallelintegrate<<<blocks, threads, 0, s>>>(particles);
// Note: implicit sync between kernel launches
// Collide each particle with others in neighborhoodcollide<<<blocks, threads, 0, s>>>(particles);
Note change in how threads map to particles in acceleration data structure
0 1 2 3
4 5 6 7
0 1 2 3
5 6 7
4
![Page 17: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/17.jpg)
1717
WHOLE-GRID COOPERATIONParticle Simulation Update in a Single Kernel
__global__ void particleSim(Particle *p, int N) {
grid_group g = this_grid();
for (i = g.thread_rank(); i < N; i += g.size())integrate(p[i]);
g.sync() // Sync whole grid!
for (i = g.thread_rank(); i < N; i += g.size())collide(p[i], p, N);
}
Launch using cudaLaunchCooperativeKernel(…)
0 1 2 3
4 5 6 7
0 1 2 3
5 6 7
4
![Page 18: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/18.jpg)
1818
MULTI-GPU COOPERATIONLarge-scale Multi-GPU Simulation in a Single Kernel
Launch using cudaLaunchCooperativeKernelMultiDevice(…)
__global__ void particleSim(Particle *p, int N) {
multi_grid_group g = this_multi_grid();
for (i = g.thread_rank(); i < N; i += g.size())integrate(p[i]);
g.sync() // Sync all GPUs!
for (i = g.thread_rank(); i < N; i += g.size())collide(p[i], p, N);
}
0 1 2 3
4 5 6 7
0 1 2 3
4 5 6 7
0 1 2 3
4 5 6 7
0 1 2 3
4 5 6 7
0 1 2 3
5 6 7
4 0 1 2 3
5 6 7
4 0 1 2 3
5 6 7
4 0 1 2 3
5 6 7
4
![Page 19: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/19.jpg)
1919
ROBUST AND EXPLICIT WARP PROGRAMMING
Volta Independent Thread Scheduling:
Program familiar algorithms and data structures in a natural way
Flexible thread grouping and synchronization
Use explicit synchronization, don’t rely on implicit convergence
CUDA 9 provides a fully explicit synchronization model
Adapt Legacy Code for New Execution Model
![Page 20: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/20.jpg)
2020
ROBUST AND EXPLICIT WARP PROGRAMMING
Eliminate implicit warp synchronous programming on all architectures
Use explicit synchronization
Focus synchronization granularity with Cooperative Groups
Transition to new *_sync() primitives
__shfl_sync(), __ballot_sync(), __any_sync(), __all_sync(), __activemask()
CUDA 9 deprecates non-synchronizing __shfl(), __ballot(), __any(), __all()
Adapt Legacy Code for New Execution Model
![Page 21: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/21.jpg)
2121
Learn More
“Cooperative Groups: Flexible CUDA Thread Programming”https://devblogs.nvidia.com/parallelforall/cooperative-groups/
GTC San Jose 2017: “Coooperative Groups”Kyrylo Perelygin and Yuan Lin
http://on-demand-gtc.gputechconf.com/gtc-quicklink/pTT9h
![Page 22: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/22.jpg)
2222
DEVELOPER TOOLS
![Page 23: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/23.jpg)
2323
UNIFIED MEMORY PROFILINGCorrelate CPU Page Faults with Source
Page Fault Correlation
![Page 24: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/24.jpg)
2424
NEW UNIFIED MEMORY EVENTS
Page ThrottlingMemory Thrashing Remote Map
Visualize Virtual Memory Activity
![Page 25: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/25.jpg)
2525
THE BEYOND SECTION
![Page 26: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/26.jpg)
2626
FUTURE: UNIFIED SYSTEM ALLOCATORAllocate unified memory using standard malloc
Removes CUDA-specific allocator restrictions
Data movement is transparently handled
Requires operating system support:
HMM Linux Kernel Module
void sortfile(FILE *fp, int N) {char *data;
// Allocate memory using any standard allocatordata = (char *) malloc(N * sizeof(char));
fread(data, 1, N, fp);
sort<<<...>>>(data,N,1,compare);
use_data(data);
// Free the allocated memoryfree(data);
}
CUDA 9 Code with System Allocator
Progress Update:HMM patchset will be integrated
into Linux Kernel 4.14NVIDIA Driver support coming
![Page 27: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/27.jpg)
2727
USING TENSOR CORES
Volta Optimized Frameworks and Libraries
__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)
{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);
}
CUDA C++Warp-Level Matrix Operations
NVIDIA cuDNN, cuBLAS, TensorRT
![Page 28: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/28.jpg)
2828
TENSOR COREMixed Precision Matrix Math4x4 matrices
D = AB + C
D =
FP16 or FP32 FP16 FP16 FP16 or FP32
A0,0 A0,1 A0,2 A0,3
A1,0 A1,1 A1,2 A1,3
A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3
B0,0 B0,1 B0,2 B0,3
B1,0 B1,1 B1,2 B1,3
B2,0 B2,1 B2,2 B2,3
B3,0 B3,1 B3,2 B3,3
C0,0 C0,1 C0,2 C0,3
C1,0 C1,1 C1,2 C1,3
C2,0 C2,1 C2,2 C2,3
C3,0 C3,1 C3,2 C3,3
![Page 29: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/29.jpg)
2929
TENSOR CORE COORDINATION
Warp-synchronizing operation for cooperative matrix math
Full Warp 16x16 Matrix Math
Aggregate Matrix Multiply and Accumulate for 16x16 matrices
Result distributed across warp
warp
warp
![Page 30: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/30.jpg)
3030
CUDA TENSOR CORE PROGRAMMING16x16x16 Warp Matrix Multiply and Accumulate (WMMA)
D = AB + C
![Page 31: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/31.jpg)
3131
CUDA TENSOR CORE PROGRAMMINGNew WMMA datatypes
wmma::fragment<matrix_a, …> Amat;
Per-Thread fragments to hold components of matrices for use with Tensor Cores
![Page 32: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/32.jpg)
3232
CUDA TENSOR CORE PROGRAMMINGNew WMMA load and store operations
wmma::load_matrix_sync(Amat, a, stride);
Warp-level operation to fetch components of matrices into fragments
warp
![Page 33: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/33.jpg)
3333
CUDA TENSOR CORE PROGRAMMINGNew WMMA Matrix Multiply and Accumulate Operation
wmma::mma_sync(Dmat, Amat, Bmat, Cmat);
Warp-level operation to perform matrix multiply and accumulate
D =
![Page 34: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/34.jpg)
3434
CUDA TENSOR CORE PROGRAMMINGNew WMMA load and store operations
wmma::store_matrix_sync(d, Dmat, stride);
Warp-level operation to fetch components of matrices into fragments
warp
Result
![Page 35: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/35.jpg)
3535
Learn More
Programming Tensor Cores in CUDA 9https://devblogs.nvidia.com/parallelforall/programming-tensor-cores-cuda-9/
![Page 36: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/36.jpg)
3636
Partition using an arbitrary label:
Use with care: random groups can lead to SIMT execution inefficiency
FUTURE COOPERATIVE GROUPSVolta Enables Greater Flexibility
// Four groups of threads with same computed valueint label = foo() % 4; thread_group block = partition(this_thread_block(), label);
![Page 37: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/37.jpg)
3737
FUTURE COOPERATIVE GROUPS
Reductions, sorting, prefix sum (scan), etc.
Library of Collective Algorithms
// collective key-value sort using all threads in the blockcooperative_groups::sort(this_thread_block(), myValues, myKeys);
// collective scan-based allocate across blockint sz = myAllocationSize(); // amount each thread wants int offset = cooperative_groups::exclusive_scan(this_thread_block(), sz);
Note: preliminary API sketch
![Page 38: Mark Harris, November 1, 2017 - GPU Technology …on-demand.gputechconf.com/.../dc7146-mark-harris-cuda-9-and-beyond.pdfMay 8-11, 2017 | Silicon Valley Mark Harris, November 1, 2017](https://reader030.fdocuments.us/reader030/viewer/2022021509/5b0c4fb37f8b9a6a6b8c254a/html5/thumbnails/38.jpg)
May 8-11, 2017 | Silicon Valley
#GTC17
CUDA 9 AND BEYOND
[email protected]@harrism
http://parallelforall.com
http://developer.nvidia.com/cuda-toolkit