Lecture 2: CUDA C Basicsmqhuang/courses/4643/s2015/... · Evolvements of NVIDIA GPU CUDA Basic...

GPU Programming

Lecture 2: CUDA C Basics

Miaoqing HuangUniversity of Arkansas

Spring 2015

1 / 33

Outline

Evolvements of NVIDIA GPU

CUDA Basic

Detailed StepsDevice Memories and Data TransferKernel Functions and Threading

2 / 33

Outline

CUDA Basic

3 / 33

Architecture of G80 GPU

SFU SFUSFU SFUSFU SFUSFU SFU SFU SFUSFU SFU

I 128 streaming processors in 16 streaming multiprocessors

4 / 33

Architecture of GT200 GPU (e.g., GeForce GTX 295)

SFU SFUSFU SFUSFU SFUSFU SFU SFU SFU SFU SFU

I 240 streaming processors in 30 streaming multiprocessors

5 / 33

Architecture of Fermi GPU (e.g., GeForce GTX 480, Tesla C2075)

Core Core

LD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/STLD/ST

Register File (32,768 × 32-bit)

Dispatch Unit Dispatch UnitWarp Scheduler Warp Scheduler

Instruction Cache

Interconnect Network64 KB Shared Memory / L1 Cache

Fermi Streaming Multiprocessor (SM)

FermiSM

L2 Cache

FermiSM

16-SM Fermi GPU

Dispatch PortOperant Collector

Result Queue

FP Unit INT Unit

CUDA Core

I 512 Fermi streaming processors in 16 streaming multiprocessors

6 / 33

Architecture of Kepler GPU (e.g., Tesla K20)

I 2,880 streaming processors in 15 streaming multiprocessors

7 / 33

Architecture of Kepler Streaming Multiprocessor

I Each streaming multiprocessor contains 192 single-precisioncores and 64 double-precision cores

8 / 33

Comparison among Three ArchitecturesGPU G80 GT200 Fermi

Transistors 681 million 1.4 billion 3.0 billionCUDA Cores 128 240 512Double Precision Capability None 30 FMAs/clk 256 FMAs/clkSingle Precision Capability 128 MADs/clk 240 MADs/clk 512 FMAs/clkSpecial Function Units / SM 2 2 4Shared Memory / SM 16KB 16KB 48KB/16KBL1 Cache / SM None None 16KB/48KBL2 Cache None None 768KBECC Support No No Yes

Product

Result

(truncate extra digits)

Multiply-Add (MAD)Product

Result

(retain all digits)

Fused Multiply-Add (FMA)

9 / 33

Fermi vs. KeplerFERMIGF100

FERMIGF104

KEPLERGK104

KEPLERGK110

Compute Capability 2.0 2.1 3.0 3.5Threads / Warp 32 32 32 32Max Warps / Multiprocessor 48 48 64 64Max Threads / Multiprocessor 1536 1536 2048 2048Max Thread Blocks / Multiprocessor 8 8 16 1632 bit Registers / Multiprocessor 32768 32768 65536 65536Max Registers / Thread 63 63 63 255Max Threads / Thread Block 1024 1024 1024 1024Shared Memory Size Configurations (bytes) 16K 16K 16K 16K

48K 48K 32K 32K48K 48K

Max X Grid Dimension 2^16 1 2^16 1 2^32 1 2^32 1Hyper Q No No No YesDynamic Parallelism No No No Yes

Compute Capability of Fermi and Kepler GPUs

10 / 33

Compute Capability

I The compute capability of a device is defined by a major revisionnumber and a minor revision number

I Devices with the same major revision number are of the samecore architecture

I Kepler architecture: 3.xI Fermi architecture: 2.xI Prior devices: 1.x

I “NVIDIA CUDA C Programming Guide (v6.5)”I Appendix A lists of all CUDA-enabled devices along with their

compute capabilityI Appendix G gives the technical specifications of each compute

capability

11 / 33

Outline

CUDA Basic

12 / 33

Data ParallelismSquare Matrix Multiplication Example

WIDTH WIDTH

I P = M × N of sizeWIDTH×WIDTH

I The approach in C

for (i=0;i<WIDTH;i++) {for (j=0;j<WIDTH;j++) {

......P[i][j]=......;......

I The approach in CUDA(the straightforward way)

I One thread calculates oneelement of P

I Issue WIDTH×WIDTHthreads simultaneously

13 / 33

WIDTH WIDTH

I The approach in C

......P[i][j]=......;......

14 / 33

WIDTH WIDTH

I The approach in C

......P[i][j]=......;......

15 / 33

The Heterogeneous Platform

I A typical GPGPU-capableplatform consists of

I One or moremicroprocessors (CPUs) –host

I One or more GPUs –device

I GPU devices are connectedto CPUs through PCIe 2.0bus

I A CUDA program consists ofI The code on CPU –

software partI The code on GPU –

hardware part

25.6GB/s25.6GB/s

CPU-GPU5.8GB/s

Unidirectional

102GB/s

16x PCIe 2.0

144GB/s

16x PCIe 2.0

16 / 33

CUDA Program Structure

Serial Code (host)

Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args);

Grid 0

Grid 1

I Integrated host+device application C programI Sequential or modestly parallel parts in host C codeI Highly parallel parts in device SPMD kernel C code

17 / 33

CUDA Thread Organization

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

I A kernel is implemented as a gridof threads

I Threads in grid is furtherdecomposed into blocks

I A grid can be up tothree-dimensional

I A thread block is a batch ofthreads that can cooperate witheach other by:

I Synchronizing their executionI Efficiently sharing data through

shared memoryI A block can be one or two or

three-dimensionalI Each block and each thread has

its own ID

18 / 33

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

its own ID

19 / 33

Matrix Multiplication: the main() function

WIDTH WIDTH

int main (void) {// 1.// Allocate and initialize// the matrices M, N, P// I/O to read the input// matrices M and N

......

// 2.// M*N on the processorMatrixMultiplication(M,N,P,width);

// 3.// I/O to write the output matrix P// Free matrices M, N, P

......return 0;

20 / 33

Matrix Multiplication: A Simple Host Version in C

WIDTH WIDTH

void MatrixMultiplication(float* M,float* N, float* P, int width)

{for (int i=0; i<width; i++) {for (int j=0; j<width; j++) {float sum = 0;for (int k=0; k<width; k++) {

float a = M[i*width+k];float b = N[k*width+j];sum += a*b;

}P[i*width+j] = sum;

M0,0 M0,1 M0,2 M0,3

M0,0 M0,1 M0,2 M0,3 M1,0 M2,0 M3,0M1,1 M2,1 M3,1M1,2 M2,2 M3,2M1,3 M2,3 M3,3

row column

21 / 33

Matrix Multiplication: Move computation to the device

WIDTH WIDTH

void MatrixMultiplication(float* M,float* N, float* P, int width)

{int size = width*width*sizeof(float);float* Md, Nd, Pd;......// 1.// Allocate device memory for M, N,// and P// Copy M and N to allocated device// memory locations

// 2.// Kernel invocation code - to have// the device to perform the actual// matrix multiplication

// 3.// Copy P from the device memory// Free device matrices

22 / 33

Outline

CUDA Basic

23 / 33

Memory Hierarchy on DeviceI Memory hierarchy on device

I Global MemoryI Main means of communicating

between host and deviceI Long latency access

I Shared MemoryI Short latency

I RegisterI Per-thread local variables

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

I Data accessI Device code can read/write

I Per-thread registers, per-block shared memory, per-grid globalmemory

I Host code can transfer data to/fromI Per-grid global memory

24 / 33

CUDA Device Memory AllocationI cudaMalloc()

I Allocates object in the deviceGlobal Memory

I Requires two parameters1. Address of a pointer to the

allocated object2. Size of of allocated object

float* Md; //d indicates a device dataint size = width*width*sizeof(float);

cudaMalloc((void**)&Md, size);cudaFree(Md);

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

I cudaFree()I Frees object from device Global Memory

I Pointer to freed object

25 / 33

CUDA Host-Device Data TransferI cudaMemcpy()

I Memory data transferI Requires four parameters

1. Pointer to destination2. Pointer to source3. Number of bytes copied4. Type of transfer

I Types of transferI cudaMemcpyHostToHostI cudaMemcpyHostToDeviceI cudaMemcpyDeviceToHostI cudaMemcpyDeviceToDevice

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

26 / 33

Matrix Multiplication: After Integrating the Data Transfer

void MatrixMultiplication(float* M, float* N, float* P, int width){int size = width*width*sizeof(float);float* Md, Nd, Pd;

// 1. Allocate and Load M, N to device memorycudaMalloc((void**)&Md, size);cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);cudaMalloc((void**)&Nd, size);cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);

// Allocate P on the devicecudaMalloc((void**)&Pd, size);

// 2. Kernel invocation code -- to be shown later

// 3. Read P from the devicecudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);

// Free device matricescudaFree(Md); cudaFree(Nd); cudaFree (Pd);

27 / 33

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

its own ID

28 / 33

Index (i.e., coordinates) of Block and Thread

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,0,0)

Thread(3,1,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

tx ty tz

M0,0 M0,1 M0,2 M0,3

I Each block and each thread are assigned an index, i.e.,blockIdx and threadIdx

I blockIdx.x, blockIdx.y, blockIdx.zI threadIdx.x, threadIdx.y, threadIdx.z

I threadIdx.y−→row, threadIdx.x−→column

29 / 33

Index (i.e., coordinates) of Block and Thread

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,0,0)

Thread(3,1,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

tx ty tz

M0,0 M0,1 M0,2 M0,3

I Each block and each thread are assigned an index, i.e.,blockIdx and threadIdx

I blockIdx.x, blockIdx.y, blockIdx.zI threadIdx.x, threadIdx.y, threadIdx.z

I threadIdx.y−→row, threadIdx.x−→column

30 / 33

Define the Dimension of Grid and Block

I Use pre-defined dim3 to define thedimension of grid and block

I Use gridDim and blockDim to get thedimensions of the grid and the block

dim3 dimGrid(width_g, height_g, depth_g);dim3 dimBlock(width_b, height_b, depth_b);

I Total number of threads issuedI width_g×height_g×depth_g×width_b×height_b×depth_b

I Launch the device computation threads

MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);

31 / 33

Kernel Function Specification

WIDTH WIDTH

// Matrix multiplication kernel// -- per thread code

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd,int width)

{int tx = threadIdx.x;int ty = threadIdx.y;

// Pvalue is used to store the// element of the matrix// that is computed by the threadfloat Pvalue = 0;

for (int k = 0; k < width; ++k) {float Melement = Md[ty*width+k];float Nelement = Nd[k*width+tx];Pvalue += Melement * Nelement;

Pd[ty*width+tx] = Pvalue;}

32 / 33

CUDA Function DeclarationsExecuted on Only callable from

__device__ float DeviceFunc() device device__globe__ void DeviceFunc() device host__host__ float HostFunc() host host

I __global__ defines a kernel functionI Must be void function

I __device__ and __host__ can be used togetherI Generate two versions of the same function during the compilation

I For functions executed on the deviceI __global__ functions does not support recursionI __device__ functions only support recursion in device code

compiled for devices of compute capability 2.x and higherI No static variable declarations inside the functionI No variable number of argumentsI No indirect function calls through pointers

I All functions are host functions by default

33 / 33

Lecture 2: CUDA C Basicsmqhuang/courses/4643/s2015/... · Evolvements of NVIDIA GPU CUDA Basic...

Documents

Transcript of Lecture 2: CUDA C Basicsmqhuang/courses/4643/s2015/... · Evolvements of NVIDIA GPU CUDA Basic...

CUDA Runtime APImqhuang/courses/4643/f2016/CUDA_Runtime_API_v7.5.pdf CUDA Runtime API v7.5 | vii cudaGraphicsResourceGetMappedPointer..... 197

Miaoqing Huang University of Arkansasmqhuang/courses/4643/s2021/lecture/GPU_Lecture_1.pdfBasic Programming Model on GPU Chapter 2: Programming Model CUDA C Programming Guide Version

CUDA Libraries and CUDA Fortran - Nvidia · CUDA Libraries and CUDA Fortran Massimiliano Fatica NVIDIA Corporation. NVIDIA CUDA Libraries CUDA Toolkit includes several libraries:

CUDA C Programming Guide - Computer Science and …mqhuang/courses/4643/s2015/CUDA_C_Progra… · CUDA C Programming Guide PG-02829-001_v6.5 | iii TABLE OF CONTENTS Chapter 1. Introduction

FDv 4643 Fv 3643 from Index 40

Debugging Experience with CUDA-GDB and CUDA …

GPGPU programming on example of CUDA - Panoramix - …panoramx.ift.uni.wroc.pl/~maq/cuda/prezentacja-cuda-eng.pdf · CPU GPU CUDA Architecture GPU programming Examples Summary GPGPU

Introduction to GPU Programming with the CUDA Platform · Resources ThispresentationandallsourcecodeareavailableatGitHub: • github.com/phrb/intro-cuda Moreresources: • CUDAC:docs.nvidia.com/cuda/cuda-c-programming-guide

HAn YAnG Assessing the evolvements and impacts of ...

Programming with CUDA · Programming with CUDA ... CUDA C programming guide – CUDA Programming 4 …

CUDA-GDB (NVIDIA CUDA Debugger)

CUDA programming Performance considerations (CUDA best practices)

Tutorial CUDA - Pascal-Man CUDA © NVIDIA Corporation ... Why GPUs? CUDA programming model, language, and runtime Break CUDA implementation on the GPU ... vec_dot…

GPU Computing with CUDA Lecture 2 - CUDA · PDF fileGPU Computing with CUDA Lecture 2 - CUDA Memories ... August, 2011 UTFSM, Valparaíso, Chile 1. ... Memory hierarchy ‣CUDA works

OULU BUSINESS SCHOOLjultika.oulu.fi/files/nbnfioulu-201711083064.pdf · Oulu Business School Unit ... preferences, values and motivation. ... technological evolvements, among other

Debugging Experience with CUDA-GDB and CUDA ......2 CUDA Debugging Solutions CUDA-GDB (Linux & Mac) CUDA-MEMCHECK (Linux, Mac, & Windows) NVIDIA® Nsight Eclipse Edition (NEW!)Visual

CUDA programming Performance considerations (CUDA best practices) NVIDIA CUDA C programming best practices guide ACK: CUDA teaching center Stanford (Hoberrock.

Chapter 1. Introduction - POLI's homepagepoli.cs.vsb.cz/edu/apps/cuda/cuda-programming.pdf · CUDA C Programming Guide Version 4.0 1 ... NVIDIA introduced CUDA™, ... Chapter 1.

An#Introduction#to#CUDA/OpenCL# …parlab.eecs.berkeley.edu/sites/all/parlab/files/CatanzaroIntroToG... · Mapping#CUDA#to#Nvidia#GPUs#! ... Introduction to CUDA! CUDA Programming

CUDA Compiler Driver NVCC - Nvidiadocs.nvidia.com/cuda//pdf/CUDA_Compiler_Driver_NVCC.pdf · CUDA Compiler Driver NVCC TRM-06721-001_v9.1 ... CUDA Programming Model ... 3.1. Command