The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X...
Transcript of The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X...
![Page 1: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/1.jpg)
![Page 2: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/2.jpg)
© NVIDIA Corporation 2009
The “New” Moore’s Law
Computers no longer get faster, just wider
You must re-think your algorithms to be parallel !
Data-parallel computing is most scalable solution
![Page 3: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/3.jpg)
© NVIDIA Corporation 2009
Enter the GPU
Massive economies of scale
Massively parallel
![Page 4: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/4.jpg)
© NVIDIA Corporation 2009
Enter CUDA
Scalable parallel programming model
Minimal extensions to familiar C/C++ environment
Heterogeneous serial-parallel computing
![Page 5: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/5.jpg)
© NVIDIA Corporation 2009
Sound Bite
GPUs + CUDA
=The Democratization of Parallel Computing
Massively parallel computing has become a commodity technology
![Page 6: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/6.jpg)
MOTIVATION
0
250
500
750
1000
Sep-02 Jan-04 May-05 Oct-06 Feb-08
Peak GFLOP/s
NVIDIA GPU Intel CPU
![Page 7: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/7.jpg)
MOTIVATION
146X
Interactive Interactive
visualization of visualization of
volumetric white volumetric white
matter connectivitymatter connectivity
36X
Ionic placement for Ionic placement for
molecular dynamics molecular dynamics
simulation on GPUsimulation on GPU
19X
Transcoding HD video Transcoding HD video
stream to H.264stream to H.264
17X
Fluid mechanics in Fluid mechanics in
Matlab using .mex file Matlab using .mex file
CUDA functionCUDA function
100X
Astrophysics NAstrophysics N--body body
simulationsimulation
149X
Financial simulation Financial simulation
of LIBOR model with of LIBOR model with
swaptionsswaptions
47X
GLAME@lab: an MGLAME@lab: an M--
script API for GPU script API for GPU
linear algebralinear algebra
20X
Ultrasound medical Ultrasound medical
imaging for cancer imaging for cancer
diagnosticsdiagnostics
24X
Highly optimized Highly optimized
object oriented object oriented
molecular dynamicsmolecular dynamics
30X
Cmatch exact string Cmatch exact string
matching to find matching to find
similar proteins and similar proteins and
gene sequencesgene sequences
![Page 8: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/8.jpg)
© NVIDIA Corporation 2009
Motivation: NVIDIA
Supercomputing Performance960 cores. 4 TeraFLOPS
250x the performance of a desktop
Personal One researcher, one supercomputer
Plugs into standard power strip
AccessibleProgram in C for Windows, Linux
Available now under $10,000
![Page 9: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/9.jpg)
© NVIDIA Corporation 2009
Accelerating Time to Insight
4.6 Days
27 Minutes
2.7 Days
30 Minutes
8 Hours
13 Minutes16 Minutes
3 Hours
CPU Only Heterogeneous with Tesla GPU
Faster is not “just faster” - David Kirk, NVIDIA Chief ScientistFaster is not “just faster” - David Kirk, NVIDIA Chief Scientist
![Page 10: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/10.jpg)
© NVIDIA Corporation 2009
CUDA: ‘C’ FOR PARALLELISM
void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)void saxpy_serial(int n, float a, float *x, float *y)
{{{{
forforforfor (int i = 0; i(int i = 0; i(int i = 0; i(int i = 0; i <<<< n; ++i)n; ++i)n; ++i)n; ++i)
y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];
}}}}
// Invoke serialserialserialserial SAXPY kernel
saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);saxpy_serial(n, 2.0, x, y);
__global__ __global__ __global__ __global__ void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)void saxpy_parallel(int n, float a, float *x, float *y)
{{{{
int i = int i = int i = int i = blockIdxblockIdxblockIdxblockIdx.x*.x*.x*.x*blockDimblockDimblockDimblockDim.x + .x + .x + .x + threadIdxthreadIdxthreadIdxthreadIdx.x;.x;.x;.x;
ifififif (i(i(i(i <<<< n) n) n) n) y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];y[i] = a*x[i] + y[i];
}}}}
// Invoke parallelparallelparallelparallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;int nblocks = (n + 255) / 256;
saxpy_parallelsaxpy_parallelsaxpy_parallelsaxpy_parallel<<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>><<<nblocks, 256>>>(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);(n, 2.0, x, y);
Standard C Code
Parallel C Code
![Page 11: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/11.jpg)
© NVIDIA Corporation 2009
Hierarchy of concurrent threads
Parallel kernels composed of many threads
all threads execute the same sequential program
Threads are grouped into thread blocks
threads in the same block can cooperate
Threads/blocks have unique IDs
Thread t
t0 t1 … tB
Block b
Kernel foo()
. . .
![Page 12: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/12.jpg)
© NVIDIA Corporation 2009
Hierarchical organization
Thread
per-threadlocal memory
Block
per-blockshared
memory
Kernel 0
. . .per-device
globalmemory
. . .
Kernel 1
. . .Global barrier
Local barrier
![Page 13: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/13.jpg)
© NVIDIA Corporation 2009
Heterogeneous Programming
CUDA = serial program with parallel kernels, all in C
Serial C code executes in a CPU thread
Parallel kernel C code executes in thread blocksacross multiple processing elements
Serial Code
. . .
. . .
Parallel Kernel
foo<<< nBlk, nTid >>>(args);
Serial Code
Parallel Kernel
bar<<< nBlk, nTid >>>(args);
![Page 14: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/14.jpg)
© NVIDIA Corporation 2009
Thread = virtualized scalar processor
Independent thread of execution
has its own PC, variables (registers), processor state, etc.
no implication about how threads are scheduled
![Page 15: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/15.jpg)
© NVIDIA Corporation 2009
Block = virtualized multiprocessor
Provides programmer flexibility
freely choose processors to fit data
freely customize for each kernel launch
Thread block = a (data) parallel task
all blocks in kernel have the same entry point
but may execute any code they want
Thread blocks of kernel must be independent tasks
program valid for any interleaving of block executions
![Page 16: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/16.jpg)
© NVIDIA Corporation 2009
Scalable Execution Model
Kernel launched by host
. . .
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
. . .
Device Memory
Blocks Run on Multiprocessors
![Page 17: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/17.jpg)
© NVIDIA Corporation 2009
Synchronization & Cooperation
Threads within block may synchronize with barriers… Step 1 …
__syncthreads();
… Step 2 …
Blocks coordinate via atomic memory operationse.g., increment shared queue pointer with atomicInc()
Implicit barrier between dependent kernelsvec_minus<<<nblocks, blksize>>>(a, b, c);
vec_dot<<<nblocks, blksize>>>(c, c);
![Page 18: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/18.jpg)
© NVIDIA Corporation 2009
Using per-block shared memory
Variables shared across block__shared__ int *begin, *end;
Scratchpad memory__shared__ int scratch[blocksize];
scratch[threadIdx.x] = begin[threadIdx.x];// … compute on scratch values …begin[threadIdx.x] = scratch[threadIdx.x];
Communicating values between threadsscratch[threadIdx.x] = begin[threadIdx.x];
__syncthreads();int left = scratch[threadIdx.x - 1];
Block
Sh
ared
![Page 19: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/19.jpg)
© NVIDIA Corporation 2009
Summing Up
CUDA = C + a few simple extensions
makes it easy to start writing basic parallel programs
Three key abstractions:
1. hierarchy of parallel threads
2. corresponding levels of synchronization
3. corresponding memory spaces
Supports massive parallelism of manycore GPUs
![Page 20: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/20.jpg)
© NVIDIA Corporation 2009
SOME FINAL THOUGHTS
We should teach parallel computing in CS 1 or CS 2
Remember: computers don’t get faster, just wider
Heapsort and mergesort
Both O(n lg n)
One parallel-friendly, one not
Students need to understand this early
![Page 21: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/21.jpg)
© NVIDIA Corporation 2009
Conclusion
GPUs are massively parallel manycore computers
Ubiquitous - most successful parallel processor in history
Useful - users achieve huge speedups on real problems
CUDA is a powerful parallel architecture and programming model
Heterogeneous - mixed serial-parallel programming
Scalable - hierarchical thread execution model
Accessible – e.g. minimal but expressive changes to C
They provide tremendous scope for innovative research
![Page 22: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/22.jpg)
Questions?
![Page 23: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/23.jpg)
Example: Vector Add w/ Host Code
![Page 24: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/24.jpg)
© NVIDIA Corporation 2008
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
Device Code
![Page 25: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/25.jpg)
© NVIDIA Corporation 2008
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Run N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
Host Code
![Page 26: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/26.jpg)
© NVIDIA Corporation 2008
Example: Host code for vecAdd
// allocate and initialize host (CPU) memory
float *h_A = …, *h_B = …;
// allocate device (GPU) memory
float *d_A, *d_B, *d_C;
cudaMalloc( (void**) &d_A, N * sizeof(float));
cudaMalloc( (void**) &d_B, N * sizeof(float));
cudaMalloc( (void**) &d_C, N * sizeof(float));
// copy host memory to device
cudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) );
cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) );
// execute the kernel on N/256 blocks of 256 threads each
vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
![Page 27: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/27.jpg)
Example: Reduction
![Page 28: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/28.jpg)
© NVIDIA Corporation 2009
Example: Parallel Reduction
Summing up a sequence with 1 thread:int sum = 0;
for(int i=0; i<N; ++i) sum += x[i];
Parallel reduction builds a summation tree
each thread holds 1 element
stepwise partial sums
N threads need log N steps
one possible approach:Butterfly pattern
![Page 29: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/29.jpg)
© NVIDIA Corporation 2009
Example: Parallel Reduction
Summing up a sequence with 1 thread:int sum = 0;
for(int i=0; i<N; ++i) sum += x[i];
Parallel reduction builds a summation tree
each thread holds 1 element
stepwise partial sums
N threads need log N steps
one possible approach:Butterfly pattern
![Page 30: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/30.jpg)
© NVIDIA Corporation 2009
Parallel Reduction for 1 Block
// INPUT: Thread i holds value x_i
int i = threadIdx.x;
__shared__ int sum[blocksize];
// One thread per element
sum[i] = x_i; __syncthreads();
for(int bit=blocksize/2; bit>0; bit/=2)
{
int t=sum[i]+sum[i^bit]; __syncthreads();
sum[i]=t; __syncthreads();
}
// OUTPUT: Every thread now holds sum in sum[i]
![Page 31: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/31.jpg)
© NVIDIA Corporation 2008
Reduction tree redux
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2
Input (shared memory)
0 1 2 3 4 5 6 7
8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2
0 1 2 3
8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
0 1
21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
0
41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2
Final result
active threads
x[i] += x[i+8];
x[i] += x[i+4];
x[i] += x[i+2];
x[i] += x[i+1];
![Page 32: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/32.jpg)
© NVIDIA Corporation 2008
Compare to interleaved addressing:
Input (shared memory)
x[i] += x[i+8];
x[i] += x[i+4];
x[i] += x[i+2];
x[i] += x[i+1];
10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2
0 1 2 3 4 5 6 7
11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2
0 1 2 3
18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2
0 1
24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
0
41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2
![Page 33: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/33.jpg)
OpenCL
![Page 34: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/34.jpg)
•© NVIDIA Corporation 2007
CUDA: An Architecture for Massively Parallel Computing
ATI’s Compute “Solution”
![Page 35: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/35.jpg)
•© NVIDIA Corporation 2007
OpenCL vs. C for CUDA
Shared back-end compiler & optimization technology
OpenCLOpenCL
C for CUDAC for CUDA
PTXPTX
GPUGPU
Entry point for developers who prefer high-level C
Entry point for developers who
want low-level API
![Page 36: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/36.jpg)
© NVIDIA Corporation 2009
FFT Kernel Example
OPENCL
__kernel void fft1D_1024 (__global float2 *in, __global float2 *out,
__local float *sMemx, __local float *sMemy)
{
int tid = get_local_id(0); int blockIdx = get_group_id(0) * 1024 + tid;
float2 data[16];
in = in + blockIdx; out = out + blockIdx;
globalLoads(data, in, 64); // coalesced global reads
fftRadix16Pass(data); // in-place radix-16 pass
twiddleFactorMul(data, tid, 1024, 0);
localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4)));
fftRadix16Pass(data); // in-place radix-16 pass
twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication
localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15)));
fftRadix4Pass(data);
fftRadix4Pass(data + 4); // four radix-4 function calls
fftRadix4Pass(data + 8)
fftRadix4Pass(data + 12);
globalStores(data, out, 64); // coalesced global writes
}
C for CUDA (Written by Vasily Volkov, © UC
__global__ void FFT1024_device( float2 *dst, float2 *src )
{
int tid = threadIdx.x; int iblock = blockIdx.y * gridDim.x + blockIdx.x;
int index = iblock * 1024 + tid; src += index; dst += index;
int hi4 = tid>>4; int lo4 = tid&15;int hi2 = tid>>4; int mi2 = (tid>>2)&3;int
lo2 = tid&3;
float2 a[16];
__shared__ float smem[69*16];
load<16>( a, src, 64 );
FFT16( a );
twiddle<16>( a, tid, 1024 );
int il[] = {0,1,2,3, 16,17,18,19, 32,33,34,35, 48,49,50,51};
transpose<16>( a, &smem[lo4*65+hi4], 4, &smem[lo4*65+hi4*4], il );
FFT4x4( a );
twiddle4x4( a, lo4 );
transpose4x4( a, &smem[hi2*17 + mi2*4 + lo2], 69, &smem[mi2*69*4 +
hi2*69 + lo2*17 ], 1, 0xE );
FFT16( a );
store<16>( a, dst, 64 );
}
Calculate IndexLoad Data
FFT Kernel
![Page 37: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/37.jpg)
© NVIDIA Corporation 2009
Different Host Code Styles
Calling a C function in nvcc
extern "C" void FFT1024( float2 *work, int batch )
{
FFT1024_device<<< grid2D(batch), 64 >>>( work, work );
}
OpenCL API-style programming
// create a compute context with GPU devicecontext = clCreateContextFromType(CL_DEVICE_TYPE_GPU);// create a work-queuequeue = clCreateWorkQueue(context, NULL, NULL, 0);// allocate the buffer memory objectsmemobjs[0] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,sizeof(float)*2*num_entries, srcA);memobjs[1] = clCreateBuffer(context,CL_MEM_READ_WRITE,sizeof(float)*2*num_entries, NULL);// create the compute programprogram = clCreateProgramFromSource(context, 1,&fft1D_1024_kernel_src, NULL);// build the compute program executableclBuildProgramExecutable(program, false, NULL, NULL);// create the compute kernelkernel = clCreateKernel(program, “fft1D_1024”);// create N-D range object with work-item dimensionsglobal_work_size[0] = n;local_work_size[0] = 64;range = clCreateNDRangeContainer(context, 0, 1,global_work_size,local_work_size);// set the args valuesclSetKernelArg(kernel, 0, (void *)&memobjs[0],sizeof(cl_mem), NULL);clSetKernelArg(kernel, 1, (void *)&memobjs[1],sizeof(cl_mem), NULL);clSetKernelArg(kernel, 2, NULL,sizeof(float)*(local_work_size[0]+1)*16, NULL);clSetKernelArg(kernel, 3, NULL,sizeof(float)*(local_work_size[0]+1)*16, NULL);// execute kernelclExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);
Source:
SIGGraph sneak preview
A Munshi, Apple Computer
NVIDIA’s PTX layer manages kernel
resources and execution
![Page 38: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/38.jpg)
Sparse Linear Algebra Results
![Page 39: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/39.jpg)
© NVIDIA Corporation 2008
Sparse Matrix-Vector Multiplication (SpMV) on CUDA
Experimented with several data structures
CSR: Compressed Sparse Row
HYB: Hybrid of ELLPACK (ELL) and Coordinate (COO) formats
HYB gave best results
Speed of ELL with flexibility of COO
Benchmarked against matrices from
“Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms”, S. Williams et al, Supercomputing 2007
![Page 40: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/40.jpg)
© NVIDIA Corporation 2008
Results: Sparse Matrix-Vector Multiplication (SpMV) on CUDA
CPU Results from “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007
![Page 41: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/41.jpg)
Double Precision
![Page 42: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/42.jpg)
© NVIDIA Corporation 2008
T10 Double Precision Floating Point
Precision IEEE 754
Rounding modes for FADD and FMUL All 4 IEEE, round to nearest, zero, inf, -inf
Denormal handling Full speed
NaN support Yes
Overflow and Infinity support Yes
Flags No
FMA Yes
Square root Software with low-latency FMA-based convergence
Division Software with low-latency FMA-based convergence
Reciprocal estimate accuracy 24 bit
Reciprocal sqrt estimate accuracy 23 bit
log2(x) and 2^x estimates accuracy 23 bit
![Page 43: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/43.jpg)
© NVIDIA Corporation 2008
Double Precision Floating Point
![Page 44: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/44.jpg)
© NVIDIA Corporation 2008
Single Precision Floating Point
G80 SSE IBM Altivec Cell SPE
Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754
Rounding modes for FADD and FMUL
Round to nearest and round to zero
All 4 IEEE, round to nearest, zero, inf, -inf
Round to nearest only
Round to zero/truncate only
Denormal handling Flush to zeroSupported,1000’s of cycles
Supported,1000’s of cycles
Flush to zero
NaN support Yes Yes Yes No
Overflow and Infinity support
Yes, only clamps to max norm
Yes Yes No, infinity
Flags No Yes Yes Some
Square root Software only Hardware Software only Software only
Division Software only Hardware Software only Software only
Reciprocal estimate accuracy
24 bit 12 bit 12 bit 12 bit
Reciprocal sqrt estimate accuracy
23 bit 12 bit 12 bit 12 bit
log2(x) and 2^x estimates accuracy
23 bit No 12 bit No
![Page 45: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/45.jpg)
Products
![Page 46: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/46.jpg)
© NVIDIA Corporation 2008
Tesla S1070 1U System
1 single precision2 typical power
4 Teraflops1
800 watts2
![Page 47: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/47.jpg)
© NVIDIA Corporation 2008
Tesla C1060 Board
1 single precision2 typical power
957 Gigaflops1
160 Watts2
![Page 48: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/48.jpg)
© NVIDIA Corporation 2008
Building a 100TF datacenter
CPU 1U Server Tesla 1U System
10x lower cost
21x lower power
4 CPU cores
0.07 Teraflop
$ 2000
400 W
1429 CPU servers
$ 3.1 M
571 KW
4 GPUs: 960 cores
4 Teraflops
$ 8000
800 W
25 CPU servers
25 Tesla systems
$ 0.31 M
27 KW
![Page 49: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/49.jpg)
© NVIDIA Corporation 2008
Tesla Personal Supercomputer
Supercomputing PerformanceMassively parallel CUDA Architecture
960 cores. 4 TeraFlops
250x the performance of a desktop
Personal One researcher, one supercomputer
Plugs into standard power strip
AccessibleProgram in C for Windows, Linux
Available now worldwide under $10,000
![Page 50: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/50.jpg)
© NVIDIA Corporation 2008
C-for-CUDA SDK
NVIDIA C Compiler
NVIDIA Assemblyfor Computing
CPU Host Code
Integrated CPUand GPU C Source Code
Libraries:FFT, BLAS,CuDPP…Example Source Code
CUDADriver
DebuggerProfiler
Standard C Compiler
GPU CPU
![Page 51: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/51.jpg)
Quotes
![Page 52: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/52.jpg)
© NVIDIA Corporation 2008
GPUs have evolved to the point where many real world applications are easily implemented on them and run significantly faster than on multi-core systems.
Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.
Jack Dongarra
Professor, University of Tennessee
Author of Linpack
![Page 53: The “New” Moore’s Law - UMIACSramani/cmsc828e_gpusci/Luebke_Maryland.… · MOTIVATION 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement](https://reader033.fdocuments.us/reader033/viewer/2022052012/602879d1aab73603de0ce811/html5/thumbnails/53.jpg)
© NVIDIA Corporation 2008
We’ve all heard ‘desktop supercomputer’ claims in the past, but this time it’s for real: NVIDIA and its partners will be delivering outstanding performance and broad applicability to the mainstream marketplace.
Heterogeneous computing is what makes such a breakthrough possible.
Burton Smith
Technical Fellow, Microsoft
Formerly, Chief Scientist at Cray