DATA-LEVEL PARALLELISMIN VECTOR, SIMD GPU A (P 2) … · 2016-04-05 · GPUS–NVIDIA ARCHITECTURE...
Transcript of DATA-LEVEL PARALLELISMIN VECTOR, SIMD GPU A (P 2) … · 2016-04-05 · GPUS–NVIDIA ARCHITECTURE...
DATA-LEVEL PARALLELISM IN
VECTOR, SIMD AND GPU
A (P 2)
CP
E731 -
Dr. Iy
ad
Jafa
r
ARCHITECTURES (PART 2)Chapter 4
Appendix A (Computer Organization and Design
Book)
1
OUTLINE
� SIMD Instruction Set Extensions for
Multimedia (4.3)
� Graphical Processing Units (4.4)
� Detecting and Enhancing Loop-Level
Parallelism (4.5)
CP
E731 -
Dr. Iy
ad
Jafa
r
2
SIMD EXTENSIONS
� SIMD Multimedia extensions started by the
observation that media application operate on data
types narrower than 32-bit !� Pixels (8 bits) and audio samples (8 or 16 bits)� Pixels (8 bits) and audio samples (8 or 16 bits)
� Partition wide HW to handle smaller operands with
few additional cost!
CP
E731 -
Dr. Iy
ad
Jafa
r
� Similar to vector ISAs, SIMD instruction operate on
vector of data
� However, SIMD instructions specify fewer operands! 3
SIMD EXTENSIONS
� Compared to Vector ISAs, SIMD
extensions� Fix the number of data operands in the opcode
while vectors ISAs have VLRwhile vectors ISAs have VLR
�Increased number of instructions in SIMD
� Do not support stride access and gather-scatter
addressing modes
�Lower possibility of vectorization
CP
E731 -
Dr. Iy
ad
Jafa
r
� Do not offer mask registers
� Harder for compiler to generate SIMD code and
more difficult to program in SIMD assembly
language!4
SIMD EXTENSIONS
� Implementations� Intel Multimedia Extensions (MMX) (1996)
� Repurposed the 64-bit floating-point registers
� Eight 8-bit integer ops or four 16-bit integer ops simultaneouslysimultaneously
� Streaming SIMD Extensions (SSE) (1999)
� Added separate 128-bit registers
� Allow 16 8-bit, 8 16-bit or 4 32-bit operations simultaneously
� SEE2, SEE3, and SEE4 � additional multimedia instructions
CP
E731 -
Dr. Iy
ad
Jafa
r
� Advanced Vector Extensions (AVX) (2010)
� Doubled the width of the registers to 256
� These extensions are intended to accelerate carefully written libraries rather than requiring the compiler to generate them.
5
SIM
D E
XT
EN
SIO
NS
CPE731 - Dr. Iyad Jafar
6
SIMD EXTENSIONS
� Why are Multimedia SIMD Extensions so
popular?
� Little cost and easy to add to standard arithmetic unit
� Little extra state
� Less memory bandwidth
� Less virtual memory problems. Fewer operands that are
aligned.
Vector architectures had issues with caches!
CP
E731 -
Dr. Iy
ad
Jafa
r
� Vector architectures had issues with caches!
7
SIMD EXTENSIONS
� MIPS SIMD
� 256-bit SIMD instructions
� Suffix ‘4D’ indicates FP SIMD that operate on � Suffix ‘4D’ indicates FP SIMD that operate on
four double precision operands at once
� Have four lanes
� Reuse the floating-point registers as operands for
4D instructions
CP
E731 -
Dr. Iy
ad
Jafa
r
� Example.
� Show the MIPS SIMD code for DAXPY!
8
SIMD EXTENSIONS
L.D F0,a ;load scalar a
MOV F1, F0 ;copy a into F1 for SIMD MUL
MOV F2, F0 ;copy a into F2 for SIMD MUL
MOV F3, F0 ;copy a into F3 for SIMD MUL
CP
E731 -
Dr. Iy
ad
Jafa
r
MOV F3, F0 ;copy a into F3 for SIMD MUL
DADDIU R4,Rx,#512 ;last address to load
Loop: L.4D F4,0(Rx) ;load X[i], X[i+1], X[i+2], X[i+3]
MUL.4D F4,F4,F0 ;a×X[i],a×X[i+1],a×X[i+2],a×X[i+3]
L.4D F8,0(Ry) ;load Y[i], Y[i+1], Y[i+2], Y[i+3]
ADD.4D F8,F8,F4 ;a×X[i]+Y[i], ..., a×X[i+3]+Y[i+3]
S.4D 0(Ry),F8 ;store into Y[i], Y[i+1], Y[i+2], Y[i+3]
DADDIU Rx,Rx,#32 ;increment index to X
� Not as beneficial as VMIPS! But better than MIPS!9
DADDIU Rx,Rx,#32 ;increment index to X
DADDIU Ry,Ry,#32 ;increment index to Y
DSUBU R20,R4,Rx ;compute bound
BNEZ R20,Loop ;check if done
SIMD EXTENSIONS
� Roofline Performance Model [Williams, 2009]
� Compares floating-point performance of variations of
SIMD architectures
� Ties floating-point performance, memory performance, � Ties floating-point performance, memory performance,
and arithmetic intensity in one graph
� Arithmetic intensity is the ratio of floating-point
operations per byte of memory accessed
CP
E731 -
Dr. Iy
ad
Jafa
r
10
GPUS – INTRODUCTION
� By the end of last century, graphics on a PC wereperformed using video graphics array (VGA), i.e. memorycontroller and display generator!
� VGAs evolved to include more advanced graphics� VGAs evolved to include more advanced graphicsfunctions (shading, texture mapping, ….)!
� By 2000, the term GPU was coined to reflect that thegraphics device has become a processor!� Programmable processors replaced graphics fixed logic
� More precise! Integer � double precision
� GPUs have become massively programmable parallel
CP
E731 -
Dr. Iy
ad
Jafa
r
� GPUs have become massively programmable parallelprocessors (100s cores and 1000s threads)
� GPUs implement all forms of parallelism;multithreading, MIMD, SIMD and ILP 11
GPUS – INTRODUCTION
� Given the hardware invested to do graphics well, howcan it be supplemented to improve performance of awider range of applications?
� GPU Computing� Using GPU for computing via parallel programming language
and API without using traditional graphics API and graphicspipeline.
� Basic idea
� Heterogeneous execution model
CPU is the host, GPU is the device
CP
E731 -
Dr. Iy
ad
Jafa
r
� CPU is the host, GPU is the device
� Develop a C-like programming language for GPU
� Compute Unified Device Architecture (CUDA)
� OpenCL for vendor-independent language
� Unify all forms of GPU parallelism as CUDA thread
� Programming model is “Single Instruction Multiple Thread” (SIMT)
12
GPUS - CUDA
� Compute Unified Device Architecture (CUDA)
is a scalable parallel programming model for the
GPU and parallel processors
� CUDA addresses the challenge of heterogeneous
system and various forms of parallelism
� CUDA produces C/C++ for the system processor
(host) and a C and C++ dialect for the GPU (device)
CP
E731 -
Dr. Iy
ad
Jafa
r
13
GPUS - THREADS AND BLOCKS
� A GPU is simply a multiprocessor system composed ofmultithreaded SIMD processors
� A thread (CUDA thread) is associated with each data element/iterationelement/iteration
� Threads are organized into thread blocks� Up to 512 elements or threads per blocks
� Each block executes on a multithreaded SIMD Processor
� 32 elements executed per thread at a time
� Blocks are organized into a grid� Blocks are executed independently and in any order
CP
E731 -
Dr. Iy
ad
Jafa
r
� Blocks are executed independently and in any order
� Different blocks cannot communicate directly but can coordinate using atomic memory operations in Global Memory
� Thread management is through GPU hardware, not applications or OS 14
GPUS - THREADS AND BLOCKSC
PE
731 -
Dr. Iy
ad
Jafa
r
Launch n threads
15
Launch n threads
256 threads per block
GPUS – NVIDIA ARCHITECTURE
� Example Multiplying two 8192-element vectors
� The code (for loop in this case) that works on the whole 8192elements is the grid (vectorized loop)
� The gird is decomposed into thread blocks (body of vectorizedloop)loop)
� Each has up to 512 elements
� Need 8192/512 or 16 blocks
� Assuming that SIMD instructions process 32 elements at atime
� Each thread block has 512/32 or 16 threads of CUDAthreads (Warp)
� SIMD processors may execute maximum number of threadssimultaneously (16 for Tesla, 32 for Fermi)
CP
E731 -
Dr. Iy
ad
Jafa
r
simultaneously (16 for Tesla, 32 for Fermi)
� A thread block is assigned to a multithreaded SIMD processorby the thread block scheduler
� Current-generation GPUs (Fermi) have 7-15 multithreadedSIMD processors 16
GP
US
CPE731 - Dr. Iyad Jafar
17
GPUS – NVIDIA ARCHITECTUREC
PE
731 -
Dr. Iy
ad
Jafa
r
18Simplified block diagram of a Multithreaded SIMD Processor.
It has 16 SIMD lanes. The SIMD Thread Scheduler has, say, 48 independent threads of SIMD instructions that it schedules with a table of 48 PCs.
GPUS – NVIDIA ARCHITECTURE
� The machine object that the hardware creates,manages, schedules, and executes is a thread ofSIMD instructions (Warp)
� Each SIMD thread� Each SIMD thread
� Contains exclusively SIMD instructions
� Has its own PC
� Runs on multithreaded SIMD processor
� Independent from other threads
� Threads in a processor are scheduled using theSIMD thread scheduler
CP
E731 -
Dr. Iy
ad
Jafa
r
SIMD thread scheduler
� It has a scoreboard to know which threads of SIMD instructions are ready to run
� Schedules threads of SIMD instructions
� Hence, two levels of scheduling!19
GPUS – NVIDIA ARCHITECTURE
• The scheduler selects a ready
CP
E731 -
Dr. Iy
ad
Jafa
r
• The scheduler selects a ready
thread of SIMD instructions and
issues an instruction synchronously
to all the SIMD Lanes executing the
SIMD thread.
• Because threads of SIMD
instructions are independent, the
scheduler may select a different
20
scheduler may select a different
SIMD thread each time
GPUS – NVIDIA ARCHITECTURE
� NVIDIA GPU has 32,768 32-bit registers
� Divided across the SIMD lanes
� Each SIMD thread is limited to 64 registers
64 vector registers of 32 32-bit elements� 64 vector registers of 32 32-bit elements
� 32 vector registers of 32 64-bit elements
� Fermi has 16 physical SIMD lanes, each containing
2048 registers
� Registers are dynamically allocated when threads
are created and freed when SIMD threads exits
CP
E731 -
Dr. Iy
ad
Jafa
r
are created and freed when SIMD threads exits
� Note that a CUDA thread is just a vertical cut of a
thread of SIMD instructions, corresponding to one
element executed by one SIMD Lane. 21
GPUS – NVIDIA ARCHITECTURE
� Terminology Summary� Thread: concurrent code and associated state executed on the
CUDA device (in parallel with other threads)
� Warp: a group of threads executed physically in parallel in G80/GT200
Block: a group of threads that are executed together and � Block: a group of threads that are executed together and form the unit of resource assignment
� Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect
� Mapping Summary� Grid is broken into thread blocks. Blocks are independent
and can execute in any order.
� Thread block consists of CUDA threads. Each 32 of which form a Warp (SIMD thread)
CP
E731 -
Dr. Iy
ad
Jafa
r
form a Warp (SIMD thread)
� Threads in a block execute the same program and are assumed to be independent
� Blocks are identified by blockIdx
� Threads are identified by threadIdx (sequential within a block) 22
GPUS – NVIDIA ARCHITECTURE
Host
Kernel
Device
Grid 1
Block Block Thread Id #:
SIMD Thread or Warp
CP
E731 -
Dr. Iy
ad
Jafa
r
Kernel 1
Kernel 2
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Block (1, 1)(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Thread Id #:
0 1 2 3 … m
Thread program
23Courtesy: NDVIA
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Courtesy: John Nickolls,
NVIDIA
NV
IDIA
GP
US
CPE731 - Dr. Iyad Jafar
24
NVIDIA GPUS PERFORMANCEC
PE
731 -
Dr. Iy
ad
Jafa
r
25
NVIDIA GPUS
� NVIDIA GTX280 Specifications
� 933 GFLOPS peak performance
� 10 thread processing clusters (TPC)
3 multiprocessors per TPC� 3 multiprocessors per TPC
� 8 cores per multiprocessor
� 16384 registers per multiprocessor
� 16 KB shared memory per multiprocessor
� 64 KB constant cache per multiprocessor
� 6 KB < texture cache < 8 KB per multiprocessor
CP
E731 -
Dr. Iy
ad
Jafa
r
� 6 KB < texture cache < 8 KB per multiprocessor
� 1.3 GHz clock rate
� Single and double-precision floating-point
calculation
� 1 GB DDR3 dedicated memory26
NV
IDIA
GP
US
CPE731 - Dr. Iyad Jafar
27
THE FERMI GPU ARCHITECTURE
� Each SIMD processor has� Two SIMD thread schedulers, two instruction dispatch
units
� 16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load-store units, 4 special function unitsstore units, 4 special function units
� Thus, two threads of SIMD instructions are scheduled every two clock cycles
� Fast double precision: gen- 78 �515 GFLOPs for DAXPY
� Caches for GPU memory: I/D L1/SIMD proc and shared L2
� 64-bit addressing and unified address space: C/C++
CP
E731 -
Dr. Iy
ad
Jafa
r
� 64-bit addressing and unified address space: C/C++ ptrs
� Error correcting codes: dependability for long-running apps
� Faster context switching: hardware support, 10X faster
� Faster atomic instructions: 5-20X faster than gen-28
THE FERMI GPU ARCHITECTUREC
PE
731 -
Dr. Iy
ad
Jafa
r
29
THE FERMI GPU ARCHITECTUREC
PE
731 -
Dr. Iy
ad
Jafa
r
30
GPUS – NVIDIA ISA� The instruction set target of NVIDIA compilers is an
abstraction of the hardware instruction set
� Parallel Thread Execution (PTX) provides a stable ISAfor compilers. The hardware ISA is hidden!
� PTX uses virtual registers. Compiler assigns required� PTX uses virtual registers. Compiler assigns requiredphysical registers.
� General format of PTX instruction
opcode.type d, a, b, c� a, b and c are operands while d is the destination
� Operands are 32-bit or 64-bit registers or constant value
� Destination d is a register or memory
CP
E731 -
Dr. Iy
ad
Jafa
r
� Check p. 299 for PTX instructions!31
GPUS – NVIDIA ISA
� PTX code for one CUDA thread in DAXPY
shl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size
add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread IDadd.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID
shl.u32 R8, R8, 3 ; byte offset
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4
add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])
CP
E731 -
Dr. Iy
ad
Jafa
r
32
GPUS – NVIDIA ISA
� Conditional Branching
� GPU hardware executes an instruction for allthreads in the same wrap before moving to nextinstruction (SIMT)instruction (SIMT)
� Works well when all threads in a warp follow thesame control flow path!
� It is not uncommon to have conditional branchingwithin a loop! CUDA threads may take differentpaths?! Branch divergence!
CP
E731 -
Dr. Iy
ad
Jafa
r
paths?! Branch divergence!
� Solution � serialize execution paths� Example: if-then-else is executed in two passes
� One pass for threads executing the THEN path � Second pass for threads executing the ELSE path� Merge threads in the warp once completed
33
GPUS – NVIDIA ISA
� Illustration
Warp of CUDA Threads
CP
E731 -
Dr. Iy
ad
Jafa
r
Branch
Path A
Path B
Branch
Path A
Path B
Pass 1 – Then Part
Pass 2 – Else Part
34
Path BPath B Pass 2 – Else Part
Merge
GPUS – NVIDIA ISA
� Implementation
� Hardware
� Internal Masks (just like vector processors)
Predicate registers (1-bit per SIMD lane)� Predicate registers (1-bit per SIMD lane)
� Branch synchronization stack per SIMD lane (nested
IF)
� Instruction markers to control masks (*comp, *push,
*pop)
Lanes are enabled or disabled based on the 1-bit
CP
E731 -
Dr. Iy
ad
Jafa
r
� Lanes are enabled or disabled based on the 1-bit
predicate registers values in each pass.
35
GPUS – NVIDIA ISA
� Example if (X[i] != 0)
X[i] = X[i] – Y[i];
else X[i] = Z[i];
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1
@!P1, braELSE1, *Push ; Push old mask, set new mask bits
; if P1 false, go to ELSE1
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
sub.f64 RD0, RD0, RD2 ; Difference in RD0
st.global.f64 [X+R8], RD0 ; X[i] = RD0
36
@P1, bra ENDIF1, *Comp ; complement mask bits
; if P1 true, go to ENDIF1
ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]
st.global.f64 [X+R8], RD0 ; X[i] = RD0
ENDIF1: <next instruction>, *Pop ; pop to restore old mask
GPUS – NVIDIA ISA
� It is like each element has its own program counter!
Illusion that each CUDA thread is acting
independently.
� Vector compliers could do the same tricks!
� Need scalar instructions to manipulate mask registers
� GPUs do it at run time!
� What if all threads take the same path?
CP
E731 -
Dr. Iy
ad
Jafa
r
� Optimization!
� When all mask bits are 0, the THEN part is skipped.
� Similarly, when the mask bits are all 1, the ELSE part is
skipped in all threads.
� Vector processors can not do it at compile time!37
GPUS – NVIDIA ISA
� Conditional Branching Performance
� How frequently divergence occurs?
� In the best case, all masks are the same! Only the THEN or
ELSE parts are executed! ELSE parts are executed!
� If at least one CUDA thread diverges, we need two passes!
50% efficiency in case the THEN and ELSE parts are of
equal lengths
� In case of nested IF-THEN-ELSE, the cost is more!
� Doubly nested � 25%
� Triply nested � 12.5%
Active research area for optimization
CP
E731 -
Dr. Iy
ad
Jafa
r
� Active research area for optimization
� Optimization? Avoid divergence when possible?
� If (threadIdx.x > 2)??
� If (threadIdx.x / WARP_SIZE > 2) ??38
GPUS – NVIDIA MEMORY STRUCTURE
Private
• off-chip, Recently, in L1 and L2 caches
• For stack and spilling registers
CP
E731 -
Dr. Iy
ad
Jafa
r
Local
• On-chip
• One per multithreaded processor
• Shared between threads in block
• Dynamically allocated to blocks
39
Global
• Off-chip
• Shared by all processors
• Accessed by host
GPUS VS. VECTOR PROCESSORS
� Both architectures � Designed to execute DLP programs
� Both have multiple processors
� However, architecturally � However, architecturally � GPUs rely on multithreading! (shallow pipelines)
� Have more registers!
� Has many lanes (8-16 vs. 2-8)
� Memory� VPs have explicit unit-stride load.
� GPUs is implicit. (address coalescing)
Branch
CP
E731 -
Dr. Iy
ad
Jafa
r
� Branch � VPs manage masks explicitly in SW. GPUs do that
at run time!
� Strip-mining in VPs requires VLR. GPUs iterate the loop until the last iteration and mask off unused lanes.
40
GPUS VS. VECTOR PROCESSORS
�Control Unit
� In VPs, it handles vector and scalar operations
� GPUs have no control unit, but the thread block
scheduler (less power efficient)scheduler (less power efficient)
� Scalar Processor
� Separate simple scalar processor in VPs
� None in GPUs. Use single SIMD lane and disable
others rather than using the system processor (less
power efficient and slower)
CP
E731 -
Dr. Iy
ad
Jafa
r
power efficient and slower)
41
GPUS VS. VECTOR PROCESSORSC
PE
731 -
Dr. Iy
ad
Jafa
r
42
GPUS VS. MULTIMEDIA SIMD PROCESSORSC
PE
731 -
Dr. Iy
ad
Jafa
r
43
READING ASSIGNMENT
� Section 4.7 – Putting it All Together
� Appendix A from the Computer
Organization and Design TextbookOrganization and Design Textbook
CP
E731 -
Dr. Iy
ad
Jafa
r
44