Post on 27-Dec-2015
Programming Massively Parallel Processors Using CUDA & C++AMP
Lecture 1 - Introduction
Wen-mei Hwu, Izzat El HajjCEA-EDF-Inria Summer School 2013
2
Course Goals
• Learn how to program massively parallel processors and achieve– high performance– functionality and maintainability– scalability across future generations
• Technical subjects– principles and patterns of parallel algorithms– processor architecture features and
constraints– programming API, tools and techniques
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
3
Text/Notes
• D. Kirk and W. Hwu, “Programming Massively Parallel Processors – A Hands-on Approach,” 2nd Edition, Morgan Kaufman Publisher, 2012,
• NVIDIA, NVidia CUDA C Programming Guide, version 5.0, NVidia, 2013 (reference book)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
4
Course Sessions Overview
Monday• Lecture 1• Lecture 2 • Lab Session 1
Tuesday• Lecture 3• Lecture 4
Wednesday• Lecture 5• Lecture 6• Lab Session 2
Thursday• Lab Session 3
Friday• Lab Session 4• Lab Session 5
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Blue Waters Hardware
• >300Cray System & Storage cabinets:
• >25,000Compute nodes:
• >1 TB/sUsable Storage Bandwidth:
• >1.5 PetabytesSystem Memory:
• 4 GBMemory per core module:
• 3D TorusGemin Interconnect Topology:
• >25 PetabytesUsable Storage:
• >11.5 PetaflopsPeak performance:
• >49,000Number of AMD Interlogos processors:
• >380,000Number of AMD x86 core modules:
• >4,000Number of NVIDIA Kepler GPUs:
5© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Cray XK7 Compute Node
Y
X
Z
HT3
HT3
PCIe Gen2
XK7 Compute Node Characteristics
AMD Series 6200 (Interlagos)
NVIDIA Kepler
Host Memory32GB
1600 MT/s DDR3
NVIDIA Tesla X2090 Memory
6GB GDDR5 capacity
Gemini High Speed Interconnect
Keplers in final installation
6© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
CPU and GPU have very different design philosophy
GPU Throughput Oriented Cores
Chip
Compute UnitCache/Local Mem
Registers
SIMD Unit
Th
read
ing
CPULatency Oriented Cores
Chip
Core
Local Cache
Registers
SIMD Unit
Co
ntro
l
7© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
CPUs: Latency Oriented Design
• Large caches– Convert long latency
memory accesses to short latency cache accesses
• Sophisticated control– Branch prediction for
reduced branch latency– Data forwarding for
reduced data latency
• Powerful ALU– Reduced operation
latency
Cache
ALU
Control
ALU
ALU
ALU
DRAM
CPU
8© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
GPUs: Throughput Oriented Design
• Small caches– To boost memory
throughput
• Simple control– No branch prediction– No data forwarding
• Energy efficient ALUs– Many, long latency but
heavily pipelined for high throughput
• Require massive number of threads to tolerate latencies
9
DRAM
GPU
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
10
Winning Applications Use Both CPU and GPU
• CPUs for sequential parts where latency matters– CPUs can be 10+X
faster than GPUs for sequential code
• GPUs for parallel parts where throughput wins– GPUs can be 10+X
faster than CPUs for parallel code
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Heterogeneous parallel computing is catching on.
• 280 submissions to GPU Computing Gems– 110 articles included in two volumes
Financial Analysis
Scientific Simulation
Engineering Simulation
Data Intensive Analytics
Medical Imaging
Digital Audio Processing
Computer Vision
Digital Video Processing
Biomedical Informatics
Electronic Design
Automation
Statistical Modeling
Ray Tracing Rendering
Interactive Physics
Numerical Methods
11© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
12
What is the stake?
• Scalable and portable software lasts through many hardware generations
Scalable algorithms and libraries can be the best legacy we can
leave behind from this era
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Introduction to CUDA C
14
Objective
• To learn about data parallelism and the basic features of CUDA C that enable exploitation of data parallelism– Hierarchical thread organization– Main interfaces for launching parallel
execution– Thread index(es) to data index mapping
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
15
Parallel Programming Work Flow
• Identify compute intensive parts of an application
• Adopt scalable algorithms• Optimize data arrangements to
maximize locality• Performance Tuning• Pay attention to code portability and
maintainability
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
16
Massive Parallelism - Regularity
© David Kirk/NVIDIA and Wen-mei W. Hwu,
2007-2013
CUDA /OpenCL – Execution Model• Integrated host+device app C program
– Serial or modestly parallel parts in host C code– Highly parallel parts in device SPMD kernel C code
Serial Code (host)
. . .
. . .
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
17© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
The Von-Neumann ModelMemory
Control Unit
I/O
ALURegFile
PC IR
Processing Unit
Processor
18© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
19
Arrays of Parallel Threads
• A CUDA kernel is executed by a grid (array) of threads – Each thread is a virtualized Von-Neumann Processor– All threads in a grid run the same kernel code (SPMD)– Each thread has an index that it uses to compute
memory addresses and make control decisions
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
tid = compute thread indexwork(tid)
Thread Blocks: Scalable Cooperation
• Divide thread array into multiple blocks– Threads within a block cooperate via shared
memory, atomic operations and barrier synchronization
– Threads in different blocks cannot cooperate
20© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Computing the Thread Index
gridDim.x
# of blocks in grid
21© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Computing the Thread Index
gridDim.x blockIdx.x
# of blocks in grid
position of block in grid
22© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Computing the Thread Index
gridDim.x blockIdx.x blockDim.x
# of blocks in grid
position of block in grid
# threads in block
23© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Computing the Thread Index
gridDim.x blockIdx.x blockDim.x threadIdx.x
# of blocks in grid
position of block in grid
# threads in block
position of thread in block
24© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Computing the Thread Index
gridDim.x blockIdx.x blockDim.x threadIdx.x
# of blocks in grid
position of block in grid
# threads in block
position of thread in block
25© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
tid = blockIdx.x*blockDim.x + threadIdx.x
blockIdx and threadIdx
• Each thread uses indices to decide what data to work on– blockIdx: 1D, 2D, or 3D
(CUDA 4.0)– threadIdx: 1D, 2D, or 3D
• Simplifies memoryaddressing when processingmultidimensional data– Image processing– Solving PDEs on volumes– …
26© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
A[0]vector A
vector B
vector C
A[1] A[2] A[3] A[4] A[N-1]
B[0] B[1] B[2] B[3]
…
B[4] … B[N-1]
C[0] C[1] C[2] C[3] C[4] C[N-1]…
+ + + + + +
Vector Addition – Conceptual View
27© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Vector Addition – Traditional C Code
// Compute vector sum C = A+Bvoid vecAdd(float* A, float* B, float* C, int n){ for (i = 0, i < n, i++) C[i] = A[i] + B[i];}
int main(){ // Memory allocation for A_h, B_h, and C_h
// I/O to read A_h and B_h, N elements …
vecAdd(A_h, B_h, C_h, N);}
28© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
#include <cuda.h>void vecAdd(float* A, float* B, float* C, int n){ int size = n* sizeof(float); float* A_d, B_d, C_d; ...1. // Allocate device memory for A, B, and C // copy A and B to device memory 2. // Kernel launch code – to have the device // to perform the actual vector addition
3. // copy C from the device memory // Free device vectors} CPU
Host Memory
GPUPart 2
Device Memory
Part 1
Part 3
Heterogeneous Computing vecAdd Host Code
29© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Control FlowCPU GPU
cudaMalloc(...)
cudaMemcpy(...)
cudaMemcpy(...)
kernel<<<...>>>(...)
cudaFree(...)
30© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Partial Overview of CUDA Memories
• Device code can:– R/W per-thread registers– R/W per-grid global memory
• Host code can– Transfer data to/from per
grid global memory
(Device) Grid
GlobalMemory
Block (0, 0)
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
We will cover more later.
31© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
(Device) Grid
GlobalMemory
Block (0, 0)
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
CUDA Device Memory Management API functions
• cudaMalloc()– Allocates object in the
device global memory– Two parameters
• Address of a pointer to the allocated object
• Size of of allocated object in terms of bytes
• cudaFree()– Frees object from device
global memory• Pointer to freed object
32© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
(Device) Grid
GlobalMemory
Block (0, 0)
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Host-Device Data Transfer API functions
• cudaMemcpy()
– memory data transfer– Requires four
parameters• Pointer to destination • Pointer to source• Number of bytes copied• Type/Direction of
transfer
– Transfer to device is asynchronous
33© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
void vecAdd(float* A, float* B, float* C, int n) {
int size = n * sizeof(float); float* A_d, B_d, C_d;
1. // Transfer A and B to device memory cudaMalloc((void **) &A_d, size); cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &B_d, size); cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);
// Allocate device memory for cudaMalloc((void **) &C_d, size);
2. // Kernel invocation code – to be shown later ...3. // Transfer C from device to host cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost); // Free device memory for A, B, C cudaFree(A_d); cudaFree(B_d); cudaFree (C_d);
}
34© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Example: Vector Addition Kernel
Input Vector 1:
Input Vector 2:
Output Vector:
assign 1 thread per element
35© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Example: Vector Addition Kernel
Input Vector 1:
Input Vector 2:
Output Vector:
quantization of global thread count due to block organization
36© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Example: Vector Addition Kernel
Input Vector 1:
Input Vector 2:
Output Vector:
boundary checks
37© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Example: Vector Addition Kernel
Input Vector 1:
Input Vector 2:
Output Vector:
data padding
38© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
// Compute vector sum C = A+B// Each thread performs one pair-wise addition__global__void vecAddKernel(float* A_d, float* B_d, float* C_d, int n){
int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i];
}
int vectAdd(float* A, float* B, float* C, int n){
// A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n);
}
Example: Vector Addition KernelDevice Code
39© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Example: Vector Addition Kernel
Host Code
40© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
// Compute vector sum C = A+B// Each thread performs one pair-wise addition__global__void vecAddKernel(float* A_d, float* B_d, float* C_d, int n){
int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i];
}
int vectAdd(float* A, float* B, float* C, int n){
// A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n);
}
More on Kernel Launchint vecAdd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each dim3 DimGrid(n/256, 1, 1); if (n%256) DimGrid.x++; dim3 DimBlock(256, 1, 1);
vecAddKernnel<<<DimGrid,DimBlock>>>(A_d, B_d, C_d, n);}
• Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking
Host Code
41© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
__global__void vecAddKernel(float *A_d, float *B_d, float *C_d, int n){ int i = blockIdx.x * blockDim.x + threadIdx.x;
if( i<n ) C_d[i] = A_d[i]+B_d[i];}
__host__Void vecAdd(){ dim3 DimGrid(ceil(n/256,1,1); dim3 DimBlock(256,1,1); vecAddKernel<<<DimGrid,DimBlock>>>(A_d,B_d,C_d,n);}
Kernel execution in a nutshell
KernelBlk 0 Blk N-1• • •
GPUM0
RAM
Mk• • •
Schedule onto multiprocessors
42© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
More on CUDA Function Declarations
hosthost__host__ float HostFunc()
hostdevice__global__ void KernelFunc()
devicedevice__device__ float DeviceFunc()
Only callable from the:
Executed on the:
__global__ defines a kernel function• Each “__” consists of two underscore characters• A kernel function must return void
__device__ and __host__ can be used together
43© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
Integrated C programs with CUDA extensions
NVCC Compiler
Host C Compiler/ Linker
Host Code Device Code (PTX)
Device Just-in-Time Compiler
Heterogeneous Computing Platform withCPUs, GPUs
Compiling A CUDA Program
44© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013
ANY MORE QUESTIONS?
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013 45