Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu,...

Programming Massively Parallel Processors Using CUDA & C++AMP

Lecture 1 - Introduction

Wen-mei Hwu, Izzat El HajjCEA-EDF-Inria Summer School 2013

Course Goals

• Learn how to program massively parallel processors and achieve– high performance– functionality and maintainability– scalability across future generations

• Technical subjects– principles and patterns of parallel algorithms– processor architecture features and

constraints– programming API, tools and techniques

Text/Notes

• D. Kirk and W. Hwu, “Programming Massively Parallel Processors – A Hands-on Approach,” 2nd Edition, Morgan Kaufman Publisher, 2012,

• NVIDIA, NVidia CUDA C Programming Guide, version 5.0, NVidia, 2013 (reference book)

Course Sessions Overview

Monday• Lecture 1• Lecture 2 • Lab Session 1

Tuesday• Lecture 3• Lecture 4

Wednesday• Lecture 5• Lecture 6• Lab Session 2

Thursday• Lab Session 3

Friday• Lab Session 4• Lab Session 5

Blue Waters Hardware

• >300Cray System & Storage cabinets:

• >25,000Compute nodes:

• >1 TB/sUsable Storage Bandwidth:

• >1.5 PetabytesSystem Memory:

• 4 GBMemory per core module:

• 3D TorusGemin Interconnect Topology:

• >25 PetabytesUsable Storage:

• >11.5 PetaflopsPeak performance:

• >49,000Number of AMD Interlogos processors:

• >380,000Number of AMD x86 core modules:

• >4,000Number of NVIDIA Kepler GPUs:

5© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2013

Cray XK7 Compute Node

PCIe Gen2

XK7 Compute Node Characteristics

AMD Series 6200 (Interlagos)

NVIDIA Kepler

Host Memory32GB

1600 MT/s DDR3

NVIDIA Tesla X2090 Memory

6GB GDDR5 capacity

Gemini High Speed Interconnect

Keplers in final installation

CPU and GPU have very different design philosophy

GPU Throughput Oriented Cores

Compute UnitCache/Local Mem

Registers

SIMD Unit

CPULatency Oriented Cores

Local Cache

Registers

SIMD Unit

CPUs: Latency Oriented Design

• Large caches– Convert long latency

memory accesses to short latency cache accesses

• Sophisticated control– Branch prediction for

reduced branch latency– Data forwarding for

reduced data latency

• Powerful ALU– Reduced operation

latency

Control

GPUs: Throughput Oriented Design

• Small caches– To boost memory

throughput

• Simple control– No branch prediction– No data forwarding

• Energy efficient ALUs– Many, long latency but

heavily pipelined for high throughput

• Require massive number of threads to tolerate latencies

Winning Applications Use Both CPU and GPU

• CPUs for sequential parts where latency matters– CPUs can be 10+X

faster than GPUs for sequential code

• GPUs for parallel parts where throughput wins– GPUs can be 10+X

faster than CPUs for parallel code

Heterogeneous parallel computing is catching on.

• 280 submissions to GPU Computing Gems– 110 articles included in two volumes

Financial Analysis

Scientific Simulation

Engineering Simulation

Data Intensive Analytics

Medical Imaging

Digital Audio Processing

Computer Vision

Digital Video Processing

Biomedical Informatics

Electronic Design

Automation

Statistical Modeling

Ray Tracing Rendering

Interactive Physics

Numerical Methods

What is the stake?

• Scalable and portable software lasts through many hardware generations

Scalable algorithms and libraries can be the best legacy we can

leave behind from this era

Introduction to CUDA C

Objective

• To learn about data parallelism and the basic features of CUDA C that enable exploitation of data parallelism– Hierarchical thread organization– Main interfaces for launching parallel

execution– Thread index(es) to data index mapping

Parallel Programming Work Flow

• Identify compute intensive parts of an application

• Adopt scalable algorithms• Optimize data arrangements to

maximize locality• Performance Tuning• Pay attention to code portability and

maintainability

Massive Parallelism - Regularity

2007-2013

CUDA /OpenCL – Execution Model• Integrated host+device app C program

– Serial or modestly parallel parts in host C code– Highly parallel parts in device SPMD kernel C code

Serial Code (host)

Parallel Kernel (device)

KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device)

KernelB<<< nBlk, nTid >>>(args);

The Von-Neumann ModelMemory

Control Unit

ALURegFile

Processing Unit

Processor

Arrays of Parallel Threads

• A CUDA kernel is executed by a grid (array) of threads – Each thread is a virtualized Von-Neumann Processor– All threads in a grid run the same kernel code (SPMD)– Each thread has an index that it uses to compute

memory addresses and make control decisions

tid = compute thread indexwork(tid)

Thread Blocks: Scalable Cooperation

• Divide thread array into multiple blocks– Threads within a block cooperate via shared

memory, atomic operations and barrier synchronization

– Threads in different blocks cannot cooperate

Computing the Thread Index

gridDim.x

# of blocks in grid

gridDim.x blockIdx.x

# of blocks in grid

position of block in grid

gridDim.x blockIdx.x blockDim.x

# of blocks in grid

# threads in block

gridDim.x blockIdx.x blockDim.x threadIdx.x

# of blocks in grid

# threads in block

position of thread in block

gridDim.x blockIdx.x blockDim.x threadIdx.x

# of blocks in grid

# threads in block

position of thread in block

tid = blockIdx.x*blockDim.x + threadIdx.x

blockIdx and threadIdx

• Each thread uses indices to decide what data to work on– blockIdx: 1D, 2D, or 3D

(CUDA 4.0)– threadIdx: 1D, 2D, or 3D

• Simplifies memoryaddressing when processingmultidimensional data– Image processing– Solving PDEs on volumes– …

A[0]vector A

vector B

vector C

A[1] A[2] A[3] A[4] A[N-1]

B[0] B[1] B[2] B[3]

B[4] … B[N-1]

C[0] C[1] C[2] C[3] C[4] C[N-1]…

+ + + + + +

Vector Addition – Conceptual View

Vector Addition – Traditional C Code

// Compute vector sum C = A+Bvoid vecAdd(float* A, float* B, float* C, int n){ for (i = 0, i < n, i++) C[i] = A[i] + B[i];}

int main(){ // Memory allocation for A_h, B_h, and C_h

// I/O to read A_h and B_h, N elements …

vecAdd(A_h, B_h, C_h, N);}

#include <cuda.h>void vecAdd(float* A, float* B, float* C, int n){ int size = n* sizeof(float); float* A_d, B_d, C_d; ...1. // Allocate device memory for A, B, and C // copy A and B to device memory 2. // Kernel launch code – to have the device // to perform the actual vector addition

3. // copy C from the device memory // Free device vectors} CPU

Host Memory

GPUPart 2

Device Memory

Part 1

Part 3

Heterogeneous Computing vecAdd Host Code

Control FlowCPU GPU

cudaMalloc(...)

cudaMemcpy(...)

kernel<<<...>>>(...)

cudaFree(...)

Partial Overview of CUDA Memories

• Device code can:– R/W per-thread registers– R/W per-grid global memory

• Host code can– Transfer data to/from per

grid global memory

(Device) Grid

GlobalMemory

Block (0, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

We will cover more later.

(Device) Grid

GlobalMemory

Block (0, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

CUDA Device Memory Management API functions

• cudaMalloc()– Allocates object in the

device global memory– Two parameters

• Address of a pointer to the allocated object

• Size of of allocated object in terms of bytes

• cudaFree()– Frees object from device

global memory• Pointer to freed object

(Device) Grid

GlobalMemory

Block (0, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host-Device Data Transfer API functions

• cudaMemcpy()

– memory data transfer– Requires four

parameters• Pointer to destination • Pointer to source• Number of bytes copied• Type/Direction of

transfer

– Transfer to device is asynchronous

void vecAdd(float* A, float* B, float* C, int n) {

int size = n * sizeof(float); float* A_d, B_d, C_d;

1. // Transfer A and B to device memory cudaMalloc((void **) &A_d, size); cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice); cudaMalloc((void **) &B_d, size); cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);

// Allocate device memory for cudaMalloc((void **) &C_d, size);

2. // Kernel invocation code – to be shown later ...3. // Transfer C from device to host cudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost); // Free device memory for A, B, C cudaFree(A_d); cudaFree(B_d); cudaFree (C_d);

Example: Vector Addition Kernel

Input Vector 1:

Input Vector 2:

Output Vector:

assign 1 thread per element

Input Vector 1:

Input Vector 2:

Output Vector:

quantization of global thread count due to block organization

Input Vector 1:

Input Vector 2:

Output Vector:

boundary checks

Input Vector 1:

Input Vector 2:

Output Vector:

data padding

// Compute vector sum C = A+B// Each thread performs one pair-wise addition__global__void vecAddKernel(float* A_d, float* B_d, float* C_d, int n){

int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i];

int vectAdd(float* A, float* B, float* C, int n){

// A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n);

Example: Vector Addition KernelDevice Code

Host Code

// Compute vector sum C = A+B// Each thread performs one pair-wise addition__global__void vecAddKernel(float* A_d, float* B_d, float* C_d, int n){

int i = threadIdx.x + blockDim.x * blockIdx.x; if(i<n) C_d[i] = A_d[i] + B_d[i];

int vectAdd(float* A, float* B, float* C, int n){

// A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n);

More on Kernel Launchint vecAdd(float* A, float* B, float* C, int n) { // A_d, B_d, C_d allocations and copies omitted // Run ceil(n/256) blocks of 256 threads each dim3 DimGrid(n/256, 1, 1); if (n%256) DimGrid.x++; dim3 DimBlock(256, 1, 1);

vecAddKernnel<<<DimGrid,DimBlock>>>(A_d, B_d, C_d, n);}

• Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking

Host Code

__global__void vecAddKernel(float *A_d, float *B_d, float *C_d, int n){ int i = blockIdx.x * blockDim.x + threadIdx.x;

if( i<n ) C_d[i] = A_d[i]+B_d[i];}

__host__Void vecAdd(){ dim3 DimGrid(ceil(n/256,1,1); dim3 DimBlock(256,1,1); vecAddKernel<<<DimGrid,DimBlock>>>(A_d,B_d,C_d,n);}

Kernel execution in a nutshell

KernelBlk 0 Blk N-1• • •

Mk• • •

Schedule onto multiprocessors

Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu,...

Documents

Transcript of Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu,...

A massively parallel solid mechanics solver in ALYA A massively ...

Howard Hwu Design Portfolio

IZZAT NAGAR

Teaching Strategies and their Role on Students’ Waleed Izzat Zohud_0.pdf · II Teaching Strategies and their Role on Students’ Engagement in Learning English By Nedaa Waleed Izzat

Hwu Technical Manual Potable Water 2004-04-27

Cliff Head Development HWU Installation and CH12 Workover ... · PDF fileCliff Head Development HWU Installation and CH12 Workover Summary Environment Plan AGR Asia Pacific Controlled

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

HWU Degree Cert

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, 2008 1 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL1, University of Illinois, Urbana-Champaign ECE 498AL1 Programming Massively Parallel Processors.

Likaa Fahmi Ahmed Izzat Development of Brushless Self ...

HWU Management Degree Programmes 2012

Lecture 15: Introduction to GPU programming€¦ · Overview GPUs & computing Principles of CUDA programming A good reference: David B. Kirk and Wen-mei W. Hwu,Programming Massively

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.

ECE 8823A GPU Architectures...1 1 ECE 8823A GPU Architectures Module 5: Execution and Resources -I Reading Assignment • Kirk and Hwu, “Programming Massively Parallel Processors:

UI Screen Digital Monitoring HWU

Muhamad Khairul Izzat Muhammad Atan

Eriks Pe Hwu Pe Hwst

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, 2008 Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

Dr Izzat Husain as consulting chef