Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON,...
-
Upload
todd-nicholson -
Category
Documents
-
view
228 -
download
4
Transcript of Graphics Processing Units REFERENCES: COMPUTER ARCHITECTURE 5 TH EDITION, HENNESSY AND PATTERSON,...
Graphics Processing UnitsREFERENCES:
•COMPUTER ARCHITECTURE 5TH EDITION, HENNESSY AND PATTERSON, 2012
•HTTP://WWW.NVIDIA.COM/CONTENT/PDF/FERMI_WHITE_PAPERS/NVIDIA_FERMI_COMPUTE_ARCHITECTURE_WHITEPAPER.PDF
•HTTP://WWW.REALWORLDTECH.COM/PAGE.CFM?ARTICLEID=RWT093009110932&P=1
•HTTP://WWW.MODERNGPU.COM/INTRO/PERFORMANCE.HTML
•HTTP://HEATHER.CS.UCDAVIS.EDU/PARPROCBOOK
CPU vs. GPU
http://chip-architect.com/news/2003_04_20_Looking_at_Intels_Prescott_part2.html
• CPU: small fraction of chip used for arithmetic
CPU vs GPU
http://www.pcper.com/reviews/Graphics-Cards/NVIDIA-GT200-Revealed-GeForce-GTX-280-and-GTX-260-Review/NVIDIA-GT200-Archite
• GPU: large fraction of chip used for arithmetic
CPU vs GPU
Intel Haswell 170 GFlops on quad-core at 3.4GHz
AMD Radeon R9 290 4800 GFlops at 9.5GHz
Nvidia GTX 970 5000 Gflops at 1.05GHz
GPGPU
General Purpose GPU programming Massively parallel
Scientific computing, brain simulations, etc
In supercomputers 53 of top500.org supercomputers used
NVIDIA/AMD GPUs (Nov 2014 ranking)
Including 2nd and 6th places
OpenCL vs CUDA
Both for GPGPU
OpenCL Open standard
Supported on AMD, NVIDIA, Intel, Altera, …
CUDA Proprietary (Nvidia)
Losing ground to OpenCL?
Similar performance
CUDA
Programming on Parallel Machines, Norm Matloff, Chapter 5
http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf
Uses a thread hierarchy Thread
Block
Grid
Thread
Executes an instance of a kernel (program)
ThreadID (within block), program counter, registers, private memory, input and output parameters
Private memory for register spills, function calls, array variables
Nvidia Fermi Whitepaper pg 6
Block
Set of concurrently executing threads
Cooperate via barrier synchronization and shared memory (fast but small)
BlockID (within grid)
Nvidia Fermi Whitepaper pg 6
Grid
Array of thread blocks running same kernel
Read and write global memory (slow – hundreds of cycles)
Synchronize between dependent kernel calls
Nvidia Fermi Whitepaper pg 6
Hardware Mapping
GPU executes 1+ kernel (program) grids
Streaming Multiprocessor (sm) executes 1+ thread blocks
CUDA core executes thread
Fermi Architecture
Debuted in 2010
512 CUDA cores executes 1 FP or integer instruction per cycle
32 CUDA cores per SM
16 SMs per GPU
6 64-bit memory ports
PCI-Express interface to CPU
GigaThread scheduler distributes blocks to SMs each SM has a thread scheduler (in hardware)
fast context switch
3 billion transistors
Nvid
ia F
erm
i W
hit
epaper
pg 7
CUDA core
pipelined integer and FP units
IEEE 754-2008 FP fused multiply-add
integer unit boolean, shift, move, compare, ...
Nvidia Fermi Whitepaper pg 8
Streaming Multiprocessor (SM) 32 CUDA cores
16 ld/st units calculate source/destination
addresses
Special Function Units sin, cosine, reciprocal, sqrt
Nvidia Fermi Whitepaper pg 8
Warps
32 threads from a block are bundled into warps which execute the same instr/cycle
this becomes the minimum size of SIMD data
warps are implicitly synchronized if threads branch in different directions, they step
through both using predicated instructions
two warp schedulers select 1 instruction from a warp each to issue to 16 cores, 16 ld/st units or 4 SFUs
Maxwell Architecture
2014
16 streaming multiprocessors * 128 cores/SM = 2048 cores
Programming CUDA
C code
daxpy(n,2.0,x,y); // invoke
void daxpy(int n, double a, double *x double *y) { for(int i=0; i<n; i++) y[i] = a*x[i] + y[i];}
Programming CUDA
CUDA code
__host__int nblocks=(n+511)/512; // grid sizedaxpy<<<nblocks,512>>(n,2.0,x,y);// 512 threads/block
__global__void daxpy(int n, double a, double *x double *y) { int i=blockIdx.x*blockDim.x + threadIdx.x; if(i<n) y[i] = a*x[i] + y[i];}
n=8192, 512 threads/blockgrid block0 warp0 Y[0]=A*X[0]+Y[0]
...
Y[31]=A*X[31]+Y[31]
...
warp15 Y[480]=A*X[480]+Y[480]
...
Y[511]=A*X[511]+Y[511]
...
block15 warp0 Y[7680]=A*X[7680]+Y[7680]
...
Y[7711]=A*X[7711]+Y[7711]
...
warp15 Y[8160]=A*X[8160]+Y[8160]
...
Y[8191]=A*X[8191]+Y[8191]
Moving data between host and GPU
int main() {double *x, *y, a, *dx, *dy;x = (double *)malloc(sizeof(double)*n);y = (double *)malloc(sizeof(double)*n);// initialize x and y…cudaMalloc(dx, n*sizeof(double));cudaMalloc(dy, n*sizeof(double));cudaMemcpy(dx, x, n*sizeof(double), cudaMemcpyHostToDevice); …daxpy<<<nblocks,512>>(n,2.0,x,y);cudaThreadSynchronize();cudaMemcpy(y, dy, n*sizeof(double), cudaMemcpyDeviceToHost);cudaMemFree(dx); cudaMemFree(dy);free(x); free(y);
}