DATA-LEVEL PARALLELISMIN VECTOR, SIMD GPU A (P 2) … · 2016-04-05 · GPUS–NVIDIA ARCHITECTURE...

DATA-LEVEL PARALLELISM IN

VECTOR, SIMD AND GPU

A (P 2)

CP

E731 -

Dr. Iy

ad

Jafa

r

ARCHITECTURES (PART 2)Chapter 4

Appendix A (Computer Organization and Design

Book)

1

OUTLINE

� SIMD Instruction Set Extensions for

Multimedia (4.3)

� Graphical Processing Units (4.4)

� Detecting and Enhancing Loop-Level

Parallelism (4.5)

CP

E731 -

Dr. Iy

ad

Jafa

r

2

SIMD EXTENSIONS

� SIMD Multimedia extensions started by the

observation that media application operate on data

types narrower than 32-bit !� Pixels (8 bits) and audio samples (8 or 16 bits)� Pixels (8 bits) and audio samples (8 or 16 bits)

� Partition wide HW to handle smaller operands with

few additional cost!

CP

E731 -

Dr. Iy

ad

Jafa

r

� Similar to vector ISAs, SIMD instruction operate on

vector of data

� However, SIMD instructions specify fewer operands! 3

SIMD EXTENSIONS

� Compared to Vector ISAs, SIMD

extensions� Fix the number of data operands in the opcode

while vectors ISAs have VLRwhile vectors ISAs have VLR

�Increased number of instructions in SIMD

� Do not support stride access and gather-scatter

addressing modes

�Lower possibility of vectorization

CP

E731 -

Dr. Iy

ad

Jafa

r

� Do not offer mask registers

� Harder for compiler to generate SIMD code and

more difficult to program in SIMD assembly

language!4

SIMD EXTENSIONS

� Implementations� Intel Multimedia Extensions (MMX) (1996)

� Repurposed the 64-bit floating-point registers

� Eight 8-bit integer ops or four 16-bit integer ops simultaneouslysimultaneously

� Streaming SIMD Extensions (SSE) (1999)

� Added separate 128-bit registers

� Allow 16 8-bit, 8 16-bit or 4 32-bit operations simultaneously

� SEE2, SEE3, and SEE4 � additional multimedia instructions

CP

E731 -

Dr. Iy

ad

Jafa

r

� Advanced Vector Extensions (AVX) (2010)

� Doubled the width of the registers to 256

� These extensions are intended to accelerate carefully written libraries rather than requiring the compiler to generate them.

5

SIM

D E

XT

EN

SIO

NS

CPE731 - Dr. Iyad Jafar

6

SIMD EXTENSIONS

� Why are Multimedia SIMD Extensions so

popular?

� Little cost and easy to add to standard arithmetic unit

� Little extra state

� Less memory bandwidth

� Less virtual memory problems. Fewer operands that are

aligned.

Vector architectures had issues with caches!

CP

E731 -

Dr. Iy

ad

Jafa

r

� Vector architectures had issues with caches!

7

SIMD EXTENSIONS

� MIPS SIMD

� 256-bit SIMD instructions

� Suffix ‘4D’ indicates FP SIMD that operate on � Suffix ‘4D’ indicates FP SIMD that operate on

four double precision operands at once

� Have four lanes

� Reuse the floating-point registers as operands for

4D instructions

CP

E731 -

Dr. Iy

ad

Jafa

r

� Example.

� Show the MIPS SIMD code for DAXPY!

8

SIMD EXTENSIONS

L.D F0,a ;load scalar a

MOV F1, F0 ;copy a into F1 for SIMD MUL



CP

E731 -

Dr. Iy

ad

Jafa

r


DADDIU R4,Rx,#512 ;last address to load

Loop: L.4D F4,0(Rx) ;load X[i], X[i+1], X[i+2], X[i+3]

MUL.4D F4,F4,F0 ;a×X[i],a×X[i+1],a×X[i+2],a×X[i+3]

L.4D F8,0(Ry) ;load Y[i], Y[i+1], Y[i+2], Y[i+3]

ADD.4D F8,F8,F4 ;a×X[i]+Y[i], ..., a×X[i+3]+Y[i+3]

S.4D 0(Ry),F8 ;store into Y[i], Y[i+1], Y[i+2], Y[i+3]

DADDIU Rx,Rx,#32 ;increment index to X

� Not as beneficial as VMIPS! But better than MIPS!9

DADDIU Rx,Rx,#32 ;increment index to X

DADDIU Ry,Ry,#32 ;increment index to Y

DSUBU R20,R4,Rx ;compute bound

BNEZ R20,Loop ;check if done

SIMD EXTENSIONS

� Roofline Performance Model [Williams, 2009]

� Compares floating-point performance of variations of

SIMD architectures

� Ties floating-point performance, memory performance, � Ties floating-point performance, memory performance,

and arithmetic intensity in one graph

� Arithmetic intensity is the ratio of floating-point

operations per byte of memory accessed

CP

E731 -

Dr. Iy

ad

Jafa

r

10

GPUS – INTRODUCTION

� By the end of last century, graphics on a PC wereperformed using video graphics array (VGA), i.e. memorycontroller and display generator!

� VGAs evolved to include more advanced graphics� VGAs evolved to include more advanced graphicsfunctions (shading, texture mapping, ….)!

� By 2000, the term GPU was coined to reflect that thegraphics device has become a processor!� Programmable processors replaced graphics fixed logic

� More precise! Integer � double precision

� GPUs have become massively programmable parallel

CP

E731 -

Dr. Iy

ad

Jafa

r

� GPUs have become massively programmable parallelprocessors (100s cores and 1000s threads)

� GPUs implement all forms of parallelism;multithreading, MIMD, SIMD and ILP 11

GPUS – INTRODUCTION

� Given the hardware invested to do graphics well, howcan it be supplemented to improve performance of awider range of applications?

� GPU Computing� Using GPU for computing via parallel programming language

and API without using traditional graphics API and graphicspipeline.

� Basic idea

� Heterogeneous execution model

CPU is the host, GPU is the device

CP

E731 -

Dr. Iy

ad

Jafa

r

� CPU is the host, GPU is the device

� Develop a C-like programming language for GPU

� Compute Unified Device Architecture (CUDA)

� OpenCL for vendor-independent language

� Unify all forms of GPU parallelism as CUDA thread

� Programming model is “Single Instruction Multiple Thread” (SIMT)

12

GPUS - CUDA

� Compute Unified Device Architecture (CUDA)

is a scalable parallel programming model for the

GPU and parallel processors

� CUDA addresses the challenge of heterogeneous

system and various forms of parallelism

� CUDA produces C/C++ for the system processor

(host) and a C and C++ dialect for the GPU (device)

CP

E731 -

Dr. Iy

ad

Jafa

r

13

GPUS - THREADS AND BLOCKS

� A GPU is simply a multiprocessor system composed ofmultithreaded SIMD processors

� A thread (CUDA thread) is associated with each data element/iterationelement/iteration

� Threads are organized into thread blocks� Up to 512 elements or threads per blocks

� Each block executes on a multithreaded SIMD Processor

� 32 elements executed per thread at a time

� Blocks are organized into a grid� Blocks are executed independently and in any order

CP

E731 -

Dr. Iy

ad

Jafa

r

� Blocks are executed independently and in any order

� Different blocks cannot communicate directly but can coordinate using atomic memory operations in Global Memory

� Thread management is through GPU hardware, not applications or OS 14

GPUS - THREADS AND BLOCKSC

PE

731 -

Dr. Iy

ad

Jafa

r

Launch n threads

15

Launch n threads

256 threads per block

GPUS – NVIDIA ARCHITECTURE

� Example Multiplying two 8192-element vectors

� The code (for loop in this case) that works on the whole 8192elements is the grid (vectorized loop)

� The gird is decomposed into thread blocks (body of vectorizedloop)loop)

� Each has up to 512 elements

� Need 8192/512 or 16 blocks

� Assuming that SIMD instructions process 32 elements at atime

� Each thread block has 512/32 or 16 threads of CUDAthreads (Warp)

� SIMD processors may execute maximum number of threadssimultaneously (16 for Tesla, 32 for Fermi)

CP

E731 -

Dr. Iy

ad

Jafa

r

simultaneously (16 for Tesla, 32 for Fermi)

� A thread block is assigned to a multithreaded SIMD processorby the thread block scheduler

� Current-generation GPUs (Fermi) have 7-15 multithreadedSIMD processors 16

GP

US


17

GPUS – NVIDIA ARCHITECTUREC

PE

731 -

Dr. Iy

ad

Jafa

r

18Simplified block diagram of a Multithreaded SIMD Processor.

It has 16 SIMD lanes. The SIMD Thread Scheduler has, say, 48 independent threads of SIMD instructions that it schedules with a table of 48 PCs.


� The machine object that the hardware creates,manages, schedules, and executes is a thread ofSIMD instructions (Warp)

� Each SIMD thread� Each SIMD thread

� Contains exclusively SIMD instructions

� Has its own PC

� Runs on multithreaded SIMD processor

� Independent from other threads

� Threads in a processor are scheduled using theSIMD thread scheduler

CP

E731 -

Dr. Iy

ad

Jafa

r

SIMD thread scheduler

� It has a scoreboard to know which threads of SIMD instructions are ready to run

� Schedules threads of SIMD instructions

� Hence, two levels of scheduling!19


• The scheduler selects a ready

CP

E731 -

Dr. Iy

ad

Jafa

r

• The scheduler selects a ready

thread of SIMD instructions and

issues an instruction synchronously

to all the SIMD Lanes executing the

SIMD thread.

• Because threads of SIMD

instructions are independent, the

scheduler may select a different

20

scheduler may select a different

SIMD thread each time


� NVIDIA GPU has 32,768 32-bit registers

� Divided across the SIMD lanes

� Each SIMD thread is limited to 64 registers

64 vector registers of 32 32-bit elements� 64 vector registers of 32 32-bit elements

� 32 vector registers of 32 64-bit elements

� Fermi has 16 physical SIMD lanes, each containing

2048 registers

� Registers are dynamically allocated when threads

are created and freed when SIMD threads exits

CP

E731 -

Dr. Iy

ad

Jafa

r

are created and freed when SIMD threads exits

� Note that a CUDA thread is just a vertical cut of a

thread of SIMD instructions, corresponding to one

element executed by one SIMD Lane. 21


� Terminology Summary� Thread: concurrent code and associated state executed on the

CUDA device (in parallel with other threads)

� Warp: a group of threads executed physically in parallel in G80/GT200

Block: a group of threads that are executed together and � Block: a group of threads that are executed together and form the unit of resource assignment

� Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect

� Mapping Summary� Grid is broken into thread blocks. Blocks are independent

and can execute in any order.

� Thread block consists of CUDA threads. Each 32 of which form a Warp (SIMD thread)

CP

E731 -

Dr. Iy

ad

Jafa

r

form a Warp (SIMD thread)

� Threads in a block execute the same program and are assumed to be independent

� Blocks are identified by blockIdx

� Threads are identified by threadIdx (sequential within a block) 22


Host

Kernel

Device

Grid 1

Block Block Thread Id #:

SIMD Thread or Warp

CP

E731 -

Dr. Iy

ad

Jafa

r

Kernel 1

Kernel 2

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Block (1, 1)(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Thread Id #:

0 1 2 3 … m

Thread program

23Courtesy: NDVIA

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Courtesy: John Nickolls,

NVIDIA

NV

IDIA

GP

US


24

NVIDIA GPUS PERFORMANCEC

PE

731 -

Dr. Iy

ad

Jafa

r

25

NVIDIA GPUS

� NVIDIA GTX280 Specifications

� 933 GFLOPS peak performance

� 10 thread processing clusters (TPC)

3 multiprocessors per TPC� 3 multiprocessors per TPC

� 8 cores per multiprocessor

� 16384 registers per multiprocessor

� 16 KB shared memory per multiprocessor

� 64 KB constant cache per multiprocessor

� 6 KB < texture cache < 8 KB per multiprocessor

CP

E731 -

Dr. Iy

ad

Jafa

r

� 6 KB < texture cache < 8 KB per multiprocessor

� 1.3 GHz clock rate

� Single and double-precision floating-point

calculation

� 1 GB DDR3 dedicated memory26

NV

IDIA

GP

US


27

THE FERMI GPU ARCHITECTURE

� Each SIMD processor has� Two SIMD thread schedulers, two instruction dispatch

units

� 16 SIMD lanes (SIMD width=32, chime=2 cycles), 16 load-store units, 4 special function unitsstore units, 4 special function units

� Thus, two threads of SIMD instructions are scheduled every two clock cycles

� Fast double precision: gen- 78 �515 GFLOPs for DAXPY

� Caches for GPU memory: I/D L1/SIMD proc and shared L2

� 64-bit addressing and unified address space: C/C++

CP

E731 -

Dr. Iy

ad

Jafa

r

� 64-bit addressing and unified address space: C/C++ ptrs

� Error correcting codes: dependability for long-running apps

� Faster context switching: hardware support, 10X faster

� Faster atomic instructions: 5-20X faster than gen-28

THE FERMI GPU ARCHITECTUREC

PE

731 -

Dr. Iy

ad

Jafa

r

29

THE FERMI GPU ARCHITECTUREC

PE

731 -

Dr. Iy

ad

Jafa

r

30

GPUS – NVIDIA ISA� The instruction set target of NVIDIA compilers is an

abstraction of the hardware instruction set

� Parallel Thread Execution (PTX) provides a stable ISAfor compilers. The hardware ISA is hidden!

� PTX uses virtual registers. Compiler assigns required� PTX uses virtual registers. Compiler assigns requiredphysical registers.

� General format of PTX instruction

opcode.type d, a, b, c� a, b and c are operands while d is the destination

� Operands are 32-bit or 64-bit registers or constant value

� Destination d is a register or memory

CP

E731 -

Dr. Iy

ad

Jafa

r

� Check p. 299 for PTX instructions!31

GPUS – NVIDIA ISA

� PTX code for one CUDA thread in DAXPY

shl.s32 R8, blockIdx, 9 ; Thread Block ID * Block size

add.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread IDadd.s32 R8, R8, threadIdx ; R8 = i = my CUDA thread ID

shl.u32 R8, R8, 3 ; byte offset

ld.global.f64 RD0, [X+R8] ; RD0 = X[i]

ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]

mul.f64 R0D, RD0, RD4 ; Product in RD0 = RD0 * RD4

add.f64 R0D, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])

st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])

CP

E731 -

Dr. Iy

ad

Jafa

r

32

GPUS – NVIDIA ISA

� Conditional Branching

� GPU hardware executes an instruction for allthreads in the same wrap before moving to nextinstruction (SIMT)instruction (SIMT)

� Works well when all threads in a warp follow thesame control flow path!

� It is not uncommon to have conditional branchingwithin a loop! CUDA threads may take differentpaths?! Branch divergence!

CP

E731 -

Dr. Iy

ad

Jafa

r

paths?! Branch divergence!

� Solution � serialize execution paths� Example: if-then-else is executed in two passes

� One pass for threads executing the THEN path � Second pass for threads executing the ELSE path� Merge threads in the warp once completed

33

GPUS – NVIDIA ISA

� Illustration

Warp of CUDA Threads

CP

E731 -

Dr. Iy

ad

Jafa

r

Branch

Path A

Path B

Branch

Path A

Path B

Pass 1 – Then Part

Pass 2 – Else Part

34

Path BPath B Pass 2 – Else Part

Merge

GPUS – NVIDIA ISA

� Implementation

� Hardware

� Internal Masks (just like vector processors)

Predicate registers (1-bit per SIMD lane)� Predicate registers (1-bit per SIMD lane)

� Branch synchronization stack per SIMD lane (nested

IF)

� Instruction markers to control masks (*comp, *push,

*pop)

Lanes are enabled or disabled based on the 1-bit

CP

E731 -

Dr. Iy

ad

Jafa

r

� Lanes are enabled or disabled based on the 1-bit

predicate registers values in each pass.

35

GPUS – NVIDIA ISA

� Example if (X[i] != 0)

X[i] = X[i] – Y[i];

else X[i] = Z[i];

ld.global.f64 RD0, [X+R8] ; RD0 = X[i]

setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1

@!P1, braELSE1, *Push ; Push old mask, set new mask bits

; if P1 false, go to ELSE1

ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]

sub.f64 RD0, RD0, RD2 ; Difference in RD0

st.global.f64 [X+R8], RD0 ; X[i] = RD0

36

@P1, bra ENDIF1, *Comp ; complement mask bits

; if P1 true, go to ENDIF1

ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]

st.global.f64 [X+R8], RD0 ; X[i] = RD0

ENDIF1: <next instruction>, *Pop ; pop to restore old mask

GPUS – NVIDIA ISA

� It is like each element has its own program counter!

Illusion that each CUDA thread is acting

independently.

� Vector compliers could do the same tricks!

� Need scalar instructions to manipulate mask registers

� GPUs do it at run time!

� What if all threads take the same path?

CP

E731 -

Dr. Iy

ad

Jafa

r

� Optimization!

� When all mask bits are 0, the THEN part is skipped.

� Similarly, when the mask bits are all 1, the ELSE part is

skipped in all threads.

� Vector processors can not do it at compile time!37

GPUS – NVIDIA ISA

� Conditional Branching Performance

� How frequently divergence occurs?

� In the best case, all masks are the same! Only the THEN or

ELSE parts are executed! ELSE parts are executed!

� If at least one CUDA thread diverges, we need two passes!

50% efficiency in case the THEN and ELSE parts are of

equal lengths

� In case of nested IF-THEN-ELSE, the cost is more!

� Doubly nested � 25%

� Triply nested � 12.5%

Active research area for optimization

CP

E731 -

Dr. Iy

ad

Jafa

r

� Active research area for optimization

� Optimization? Avoid divergence when possible?

� If (threadIdx.x > 2)??

� If (threadIdx.x / WARP_SIZE > 2) ??38

GPUS – NVIDIA MEMORY STRUCTURE

Private

• off-chip, Recently, in L1 and L2 caches

• For stack and spilling registers

CP

E731 -

Dr. Iy

ad

Jafa

r

Local

• On-chip

• One per multithreaded processor

• Shared between threads in block

• Dynamically allocated to blocks

39

Global

• Off-chip

• Shared by all processors

• Accessed by host

GPUS VS. VECTOR PROCESSORS

� Both architectures � Designed to execute DLP programs

� Both have multiple processors

� However, architecturally � However, architecturally � GPUs rely on multithreading! (shallow pipelines)

� Have more registers!

� Has many lanes (8-16 vs. 2-8)

� Memory� VPs have explicit unit-stride load.

� GPUs is implicit. (address coalescing)

Branch

CP

E731 -

Dr. Iy

ad

Jafa

r

� Branch � VPs manage masks explicitly in SW. GPUs do that

at run time!

� Strip-mining in VPs requires VLR. GPUs iterate the loop until the last iteration and mask off unused lanes.

40

GPUS VS. VECTOR PROCESSORS

�Control Unit

� In VPs, it handles vector and scalar operations

� GPUs have no control unit, but the thread block

scheduler (less power efficient)scheduler (less power efficient)

� Scalar Processor

� Separate simple scalar processor in VPs

� None in GPUs. Use single SIMD lane and disable

others rather than using the system processor (less

power efficient and slower)

CP

E731 -

Dr. Iy

ad

Jafa

r

power efficient and slower)

41

GPUS VS. VECTOR PROCESSORSC

PE

731 -

Dr. Iy

ad

Jafa

r

42

GPUS VS. MULTIMEDIA SIMD PROCESSORSC

PE

731 -

Dr. Iy

ad

Jafa

r

43

READING ASSIGNMENT

� Section 4.7 – Putting it All Together

� Appendix A from the Computer

Organization and Design TextbookOrganization and Design Textbook

CP

E731 -

Dr. Iy

ad

Jafa

r

44

DATA-LEVEL PARALLELISMIN VECTOR, SIMD GPU A (P 2) … · 2016-04-05 · GPUS–NVIDIA ARCHITECTURE...

Documents

Transcript of DATA-LEVEL PARALLELISMIN VECTOR, SIMD GPU A (P 2) … · 2016-04-05 · GPUS–NVIDIA ARCHITECTURE...