GPU History CUDA - cse.uaa.alaska.eduafkjm/cs331/handouts/gpu-cuda.pdf · CUDA Intro Graphics...

4/25/2012

GPU History CUDA Intro

Graphics Pipeline Elements

1. A scene description: vertices, triangles, colors,

lighting

2.Transformations that map the scene to a

camera viewpoint

3.“Effects”: texturing, shadow mapping, lighting

calculations

4.Rasterizing: converting geometry into pixels

5.Pixel processing: depth tests, stencil tests, and

other per-pixel operations.

4/25/2012

Vertex Control Vertex

Cache VS/T&L

Triangle Setup

Raster

Shader

Texture

Cache Frame

Buffer

Memory

GPU Host Interface

A Fixed Function

GPU Pipeline

Texture mapping example: painting a world map

texture image onto a globe object.

Texture Mapping Example

4/25/2012

3D Application

or Game

3D API:

OpenGL or

Direct3D

Programmable

Vertex

Processor

Primitive

Assembly

Rasterization &

Interpolation

3D API

Commands

Transformed

Vertices

Assembled

Polygons,

Lines, and

Points

Command &

Data Stream

Programmable

Fragment

Processor

Rasterized

Pre-transformed

Fragments

Transformed

Fragments

Raster

Ops Framebuffer

Updates GPU

Pre-transformed

Vertices

Vertex Index

Stream

Location

Stream

CPU – GPU Boundary

An example of separate vertex processor and fragment processor in

a programmable graphics pipeline

Programmable Vertex and Pixel Processors

What is (Historical) GPGPU ?

• General Purpose computation using GPU and graphics API in

applications other than 3D graphics

– GPU accelerates critical path of application

• Data parallel algorithms leverage GPU attributes

– Large data arrays, streaming throughput

– Model is SPMD

– Low-latency floating point (FP) computation

• Applications – see http://gpgpu.org

– Game effects (FX) physics, image processing

– Physical modeling, computational engineering, matrix algebra,

convolution, correlation, sorting

4/25/2012

Tesla GPU

• NVIDIA developed a more general purpose GPU

• Can programming it like a regular processor

• Must explicitly declare the data parallel parts of the

workload

– Shader processors fully programming processors with

instruction memory, cache, sequencing logic

– Memory load/store instructions with random byte

addressing capability

– Parallel programming model primitives; threads, barrier

synchronization, atomic operations

• “Compute Unified Device Architecture”

• General purpose programming model

– User kicks off batches of threads on the GPU

– GPU = dedicated super-threaded, massively data parallel co-processor

• Targeted software stack

– Compute oriented drivers, language, and tools

• Driver for loading computation programs into GPU

– Standalone Driver - Optimized for computation

– Interface designed for compute – graphics-free API

– Data sharing with OpenGL buffer objects

– Guaranteed maximum download & readback speeds

– Explicit GPU memory management

4/25/2012

CUDA Devices and Threads

• A compute device

– Is a coprocessor to the CPU or host

– Has its own DRAM (device memory)

– Runs many threads in parallel

– Is typically a GPU but can also be another type of parallel processing

device

• Data-parallel portions of an application are expressed as device

kernels which run on many threads

• Differences between GPU and CPU threads

– GPU threads are extremely lightweight

• Very little creation overhead

– GPU needs 1000s of threads for full efficiency

• Multi-core CPU needs only a few

G80 CUDA mode – A Device Example

• Processors execute computing threads

• New operating mode/HW interface for computing

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Texture Texture Texture Texture Texture Texture Texture Texture Texture

Parallel Data

Load/store Load/store Load/store Load/store Load/store

4/25/2012

Arrays of Parallel Threads

• A CUDA kernel is executed by an array of threads – All threads run the same code (SPMD)‏

– Each thread has an ID that it uses to compute memory addresses and make control decisions

7 6 5 4 3 2 1 0

float x = input[threadID];

float y = func(x);

output[threadID] = y;

threadID

float x =

input[threadID];

float y = func(x);

threadID

Thread Block 0

… …

float x =

input[threadID];

float y = func(x);

Thread Block 1

float x =

input[threadID];

float y = func(x);

Thread Block N - 1

Thread Blocks: Scalable Cooperation

• Divide monolithic thread array into multiple blocks

– Threads within a block cooperate via shared memory,

atomic operations and barrier synchronization

– Threads in different blocks cannot cooperate

– Up to 65535 blocks, 512 threads/block

7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0

4/25/2012

Kernel

Device

Grid 1

(0, 0)

(1, 0)

(0, 1)

(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread

(0,1,0)

Thread

(1,1,0)

Thread

(2,1,0)

Thread

(3,1,0)

Thread

(0,0,0)

Thread

(1,0,0)

Thread

(2,0,0)

Thread

(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Block IDs and Thread IDs

• We launch a “grid” of “blocks” of “threads”

• Each thread uses IDs to decide what data to work on – Block ID: 1D or 2D

– Thread ID: 1D, 2D, or 3D

• Simplifies memory addressing when processing multidimensional data – Image processing

– Solving PDEs on volumes

– …

CUDA Memory Model Overview

• Global memory

– Main means of

communicating R/W

Data between host and

device

– Contents visible to all

threads

– Long latency access

Global Memory

Block (0, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Block (1, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

4/25/2012

CUDA Device Memory Allocation

• cudaMalloc()

– Allocates object in the

device Global Memory

– Requires two parameters

• Address of a pointer to the

allocated object

• Size of allocated object

• cudaFree()

– Frees object from device

Global Memory

• Pointer to freed object

Global

Memory

Block (0, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Block (1, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

DON’T use a CPU

pointer in a GPU

function !

CUDA Device Memory Allocation (cont.)

• Code example:

– Allocate a 64 * 64 single precision float array

– Attach the allocated storage to Md

– “d” is often used to indicate a device data structure

TILE_WIDTH = 64;

float* Md;

int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);

cudaMalloc((void**)&Md, size);

cudaFree(Md);

4/25/2012

CUDA Host-Device Data Transfer

• cudaMemcpy()

– memory data transfer

– Requires four parameters

• Pointer to destination

• Pointer to source

• Number of bytes copied

• Type of transfer

– Host to Host

– Host to Device

– Device to Host

– Device to Device

• Non-blocking/asynchronous

transfer

Global

Memory

Block (0, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

Block (1, 0)‏

Shared Memory

Thread (0, 0)‏

Registers

Thread (1, 0)‏

Registers

CUDA Host-Device Data Transfer

(cont.)

• Code example:

– Transfer a 64 * 64 single precision float array

– M is in host memory and Md is in device memory

– cudaMemcpyHostToDevice and

cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);

cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

4/25/2012

CUDA Function Declarations

host host __host__ float HostFunc()‏

host device __global__ void KernelFunc()‏

device device __device__ float DeviceFunc()‏

Only callable

from the:

Executed

on the:

• __global__ defines a kernel function

– Must return void

• __device__ and __host__ can be used

together

__global__ void add(int a, int b, int *c)

*c = a + b;

int main()

int a,b,c;

int *dev_c;

cudaMalloc((void**)&dev_c, sizeof(int));

add<<<1,1>>>(a,b,dev_c); // 1 Block and 1 Thread/Block

cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost);

printf("%d + %d is %d\n", a, b, c);

cudaFree(dev_c);

return 0;

Code Example

4/25/2012

#define N 10

void add(int *a, int *b, int *c)

int tID = 0;

while (tID < N)

c[tID] = a[tID] + b[tID];

tID += 1;

int main()

int a[N], b[N], c[N];

// Fill Arrays

for (int i = 0; i < N; i++)

a[i] = i,

b[i] = 1;

add (a, b, c);

for (int i = 0; i < N; i++)

printf("%d + %d = %d\n", a[i], b[i], c[i]);

return 0;

Sequential Code – Adding Arrays

#include "stdio.h"

#define N 10

__global__ void add(int *a, int *b, int *c)

int tID = blockIdx.x;

if (tID < N)

c[tID] = a[tID] + b[tID];

CUDA Code – Adding Arrays

int main()

int a[N], b[N], c[N];

int *dev_a, *dev_b, *dev_c;

cudaMalloc((void **) &dev_a, N*sizeof(int));

cudaMalloc((void **) &dev_b, N*sizeof(int));

cudaMalloc((void **) &dev_c, N*sizeof(int));

// Fill Arrays

for (int i = 0; i < N; i++)

a[i] = i,

b[i] = 1;

cudaMemcpy(dev_a, a, N*sizeof(int), cudaMemcpyHostToDevice);

cudaMemcpy(dev_b, b, N*sizeof(int), cudaMemcpyHostToDevice);

add<<<N,1>>>(dev_a, dev_b, dev_c);

cudaMemcpy(c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost);

for (int i = 0; i < N; i++)

printf("%d + %d = %d\n", a[i], b[i], c[i]);

return 0;

4/25/2012

Julia Fractal

• Evaluates an iterative equation for points in the complex plane

– A point is not in the set if iterating diverges and approaches infinity

– A point is in the set if iterating remains bounded

• Equation

– Zn+1=Zn2

• Where Z is a point in the complex plane, C is a constant

• Our implementation uses the freeimage library

CPU Fractal Implementation

• Structure to store, multiply, and divide complex

numbers

#include "FreeImage.h"

#include "stdio.h"

#define DIM 1000

struct cuComplex {

float r;

float i;

cuComplex( float a, float b ) : r(a), i(b) {}

float magnitude2( void ) { return r * r + i * i; }

cuComplex operator*(const cuComplex& a) {

return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);

cuComplex operator+(const cuComplex& a) {

return cuComplex(r+a.r, i+a.i);

4/25/2012

• Julia function

int julia(int x, int y)

const float scale = 1.5;

float jx = scale * (float)(DIM/2 - x)/(DIM/2);

float jy = scale * (float)(DIM/2 - y)/(DIM/2);

cuComplex c(-0.8, 0.156);

cuComplex a(jx, jy);

int i = 0;

for (i = 0; i < 200; i++)

a = a*a + c;

if (a.magnitude2() > 1000) return 0;

return 1;

• What will become our kernel

– Array of char is 0 or 1 to indicate pixel or no pixel

void kernel(char *ptr)

for (int y = 0; y<DIM; y++)

for (int x=0; x<DIM; x++)

int offset = x + y * DIM;

ptr[offset] = julia(x,y);

4/25/2012

CPU Fractal Implementation int main()

FreeImage_Initialise();

FIBITMAP * bitmap = FreeImage_Allocate(DIM, DIM, 32);

char charmap[DIM][DIM];

kernel(&charmap[0][0]);

RGBQUAD color;

for (int i = 0; i < DIM; i++){

for (int j = 0; j < DIM; j++){

color.rgbRed = 0;

color.rgbGreen = 0;

color.rgbBlue = 0;

if (charmap[i][j]!=0)

color.rgbBlue = 255;

FreeImage_SetPixelColor(bitmap, i, j, &color);

FreeImage_Save(FIF_BMP, bitmap, "output.bmp");

FreeImage_Unload(bitmap);

return 0;

GPU Fractal Implementation

• Assign the computation of each point to a processor

• Use a 2D block and the blockIdx.x and blockIdx.y

variables to determine which pixel we should be

working on

4/25/2012

GPU Fractal

• __device__ makes this accessible from the compute

device

__device__ struct cuComplex {

float r;

float i;

__device__ cuComplex( float a, float b ) : r(a), i(b) {}

__device__ float magnitude2( void ) { return r * r + i * i; }

__device__ cuComplex operator*(const cuComplex& a) {

return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);

__device__ cuComplex operator+(const cuComplex& a) {

return cuComplex(r+a.r, i+a.i);

GPU Fractal

__device__ int julia(int x, int y)

// Same as CPU version

__global__ void kernel(char *ptr)

int x = blockIdx.x;

int y = blockIdx.y;

int offset = x + y * DIM;

ptr[offset] = julia(x,y);

4/25/2012

GPU Fractal

int main()

FreeImage_Initialise();

FIBITMAP * bitmap = FreeImage_Allocate(DIM, DIM, 32);

char charmap[DIM][DIM];

char *dev_charmap;

cudaMalloc((void**)&dev_charmap, DIM*DIM*sizeof(char));

dim3 grid(DIM,DIM);

kernel<<<grid,1>>>(dev_charmap);

cudaMemcpy(charmap, dev_charmap, DIM*DIM*sizeof(char),

cudaMemcpyDeviceToHost);

GPU Fractal

RGBQUAD color;

for (int i = 0; i < DIM; i++){

for (int j = 0; j < DIM; j++){

color.rgbRed = 0;

color.rgbGreen = 0;

color.rgbBlue = 0;

if (charmap[i][j]!=0)

color.rgbBlue = 255;

FreeImage_SetPixelColor(bitmap, i, j, &color);

FreeImage_Save(FIF_BMP, bitmap, "output.bmp");

FreeImage_Unload(bitmap);

cudaFree(dev_charmap);

return 0;

GPU History CUDA - cse.uaa.alaska.eduafkjm/cs331/handouts/gpu-cuda.pdf · CUDA Intro Graphics...

Documents

Transcript of GPU History CUDA - cse.uaa.alaska.eduafkjm/cs331/handouts/gpu-cuda.pdf · CUDA Intro Graphics...

GPU History CUDA Intro

INF5063 – GPU & CUDA

CUDA Without Cuda (CUDA Libraries) - Nvidiadeveloper.download.nvidia.com/CUDA/training/ntrotoCUDALibraries.pdf · CUDA Without Cuda (CUDA Libraries) GPU Computing Webinar 7/16/2011

GPU Architecture & CUDA Programming

Code gpu with cuda - CUDA introduction

Gpu, Cuda and Pycuda

GPU and CUDA - UniFI

NVIDIA CUDA - Ohjelmistotekniikan Laboratoriotlilja/multicore/slides/cuda.pdf · NVIDIA CUDA Seminar on Multi-core Programming Feb 26, 2009. Hello, ... 10 CUDA Programming Guide Version

GPU Computing with CUDA

Building CUDA Photoshop Filters for the GPU - Nvidiadeveloper.download.nvidia.com/.../cuda/Photoshop/... · December 2008 Building CUDA Photoshop Filters for the GPU James Fung and

Chapter 18 GPU ( CUDA)

GPU Programming Using CUDA

GMAC-2: Easy and Efficient Programming for CUDA-Based Systemsccoe.ac.upc.edu/_media/s0333-gtc2012-gmac-programming-cuda.pdf · NVIDIA GPU Technology Conference – May, 17th, 2012

GPU Programming and CUDA

CUDA And GPU Programming - CUDA Teaching Center at UGA

GPU Architecture and CUDA Programming

INF5062 – GPU & CUDA

Introduction To CUDA · GPU and CUDA • Popular – Over 100 million CUDA enabled GPU sold • Easy to program using CUDA – C and C++ Integration – Sizeable computing libraries

CUDA Programmingcacs.usc.edu/education/cs653/CUDA.pdf · • GPGPU: General-purpose computing on GPU, by using a GPU to perform computation traditionally handled by the CPU; GPU is

GPU Computing with CUDA Lecture 6 - CUDA Libraries - Thrust · GPU Computing with CUDA Lecture 6 - CUDA Libraries - Thrust Christopher Cooper Boston University August, 2011 UTFSM,