Operation of the SM Pipeline
-
Upload
hillary-james -
Category
Documents
-
view
228 -
download
0
description
Transcript of Operation of the SM Pipeline
Operation of the SM Pipeline
Sudhakar Yalamanchili unless otherwise noted Objectives Cycle-level
examination of the operation of majorpipeline stages in a stream
multiprocessor Understand the type of information necessary for
eachstage of operation Identification of performance bottlenecks
Detailed implementations are addressed in subsequent modules
Reading Documentation for the GPGPUSim simulator
Good source of information about the general organization and
operation of a stream multiprocessor Operation of a Scoreboard
https://en.wikipedia.org/wiki/Scoreboarding X. Xiang, Y. Yiang, H.
Zhou, Warp Level Divergence in GPUs:Characterization, Impact, and
Mitigation, InternationalSymposium on High Performance Computer
Architecture, 2014. D. Tarjan and K. Skadron, On Demand Register
Allocation andDeallocation for a Multithreaded Processor, US
Patent2011/ A1, June 2011 NVIDIA GK110 (Keplar) Thread Block
Scheduler
Image from SMX Organization : GK 110 Multiple Warp Schedulers 64K
32-bit registers 192 cores 6 clusters of 32 cores each What are the
main stages of a generic SMX pipeline? Image from A Generic SM
Pipeline Scalar Fetch & Decode
Warp 6 Warp 1 Warp 2 Decode RF PRF D-Cache Data All Hit? Writeback
Pending Warps Pipeline scalar pipeline Issue I-Buffer I-Fetch Miss?
Scalar Fetch & Decode Instruction Issue & Warp Scheduler
Front-end Predicate & GP Register Files Scalar Cores Scalar
Pipelines Data Memory Access Back-end Writeback/Commit Single Warp
Execution PC AM WID State warp state Thread Block
setp.lt.s32 %p, %r5, %rd4;//r5 = index, rd4 = N @p bra L1; bra L2;
L1: ld.global.f32 %f1, [%r6];//r6 = &a[index] ld.global.f32
%f2, [%r7];//r7 = &b[index] add.f32 %f3, %f1, %f2;
st.global.f32 [%r8], %f3; //r8=&c[index] L2: ret; PTX
(Assembly): Grid Instruction Fetch & Decode
I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline
scalar pipeline Issue I-Buffer pending warps Examples from
Harmonica2 GPU PC AM WID State Instr Warp 0 PC Warp 1 PC To I-Cache
Warp n-1 PC May realize multiple fetch policies Next Warp From
GPGPU-Sim Documentation Instruction Buffer Buffer a fixed number of
instructions per warp
I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline
scalar pipeline Issue I-Buffer pending warps Example: buffer 2
instructions/warp Decoded instruction V Instr 1 W1 R Instr 2 W1
Instr 2 Wn Instr 1 W2 Scoreboard ECE 6100/CS 6290 Buffer a fixed
number of instructionsper warp Coordinated with instruction fetch
Need an empty I-buffer for the warp V: valid instruction in the
buffer R: instruction ready to be issued Set using the scoreboard
logic From GPGPU-Sim Documentation Instruction Buffer (2)
Scoreboard enforces and WAW and RAW hazards
I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline
scalar pipeline Issue I-Buffer pending warps V Instr 1 W1 R Instr 2
W1 Instr 2 Wn Instr 1 W2 Scoreboard Scoreboard enforces and WAW
andRAW hazards Indexed by Warp ID Each entry hosts required
registers, Destination registers are reserved at issue Reserved
registers released at writeback Enables multiple instructions to
beissued from a single warp From GPGPU-Sim Documentation
Instruction Buffer (3) Scoreboard Generic Scoreboard Name Busy Op
Fi
I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline
scalar pipeline Issue I-Buffer pending warps V Instr 1 W1 R Instr 2
W1 Instr 2 Wn Instr 1 W2 Scoreboard Generic Scoreboard dest reg
src1 src2 Source Registers have value? Function unit producing
value Name Busy Op Fi Fj Fk Qj Qk Rj Rk Int Yes Load F2 R3 No From
GPGPU-Sim Documentation Instruction Issue I-Fetch Decode RF PRF
D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue
I-Buffer pending warps pool of ready warps Warp 3 Warp 8 Warp 7
instruction Warp Scheduler Manages implementation ofbarriers,
register dependencies, andcontrol divergence From GPGPU-Sim
Documentation Instruction Issue (2) warp I-Fetch Decode RF PRF
D-Cache Data All Hit? Writeback Pipeline scalar pipeline Issue
I-Buffer pending warps barrier Warp 3 Warp 8 Warp 7 instruction
Warp Scheduler Barriers warps wait here forbarrier synchronization
All threads in the CTA must reach the barrier From GPGPU-Sim
Documentation Instruction Issue (3) I-Fetch Decode RF PRF D-Cache
Data All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer
pending warps Scoreboard V Instr 1 W1 R Instr 2 W1 Instr 2 Wn Instr
1 W2 Warp 3 Warp 8 Warp 7 instruction Warp Scheduler Register
Dependencies - trackthrough the scoreboard From GPGPU-Sim
Documentation Instruction Issue (4) Control Divergence - per warp
stack
I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline
scalar pipeline Issue I-Buffer pending warps divergent warps Warp 3
Warp 8 Warp 7 instruction Keeps track of divergent threads at a
branch Warp Scheduler SIMT Stack (per warp) Control Divergence -
per warpstack From GPGPU-Sim Documentation Instruction Issue (5)
Scheduler can issue multiple instructions from a warp Issue
conditions Has valid instructions Not waiting at a barrier
Scoreboard check Pipeline line is not stalled: operand access stage
(will get to it later) Reserve destination registers Instructions
may issue to memory, SP or SFUpipelines Warp scheduling disciplines
more later in thecourse Single ported Register File Banks
Register File Access Banks 0-15 I-Fetch Decode RF PRF D-Cache Data
All Hit? Writeback Pipeline scalar pipeline Issue I-Buffer pending
warps Arbiter RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF
n-3 RF n-4 RF1 RF0 RF n-1 RF n-2 RF n-3 RF n-4 RF1 RF0 Single
ported Register File Banks 1024 bit Xbar OC OC OC OC Operand
Collectors (OC) DU DU DU DU Dispatch Units (DU) ALUs L/S SFU Scalar
Pipeline Functional units are pipelined
I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline
scalar pipeline Issue I-Buffer pending warps Functional units are
pipelined Designs with multiple issue Dispatch ALU FPU LD/SD Result
Queue A Single Core Shared Memory Access Multiple bank
organization
I-Fetch Decode RF PRF D-Cache Data All Hit? Writeback Pipeline
scalar pipeline Issue I-Buffer pending warps 2-way Conflict access
Conflict free access Multiple bankorganization Data
isinterleavedacross banks Bank conflictsextend accesstimes Memory
Request Coalescing
Memory Requests Tid RQ Size Base Add Offset Pending Request Table
Memory Address Coalescing Pending RQ Count Addr Mask Thread Masks
PRT is filled whenevera memory request isissued Generate a set
ofaddress masks onefor each memorytransaction Issue transactions
From J. Leng et.al., GPUWattch : Enabling Energy Optimizations in
GPGPUs, ISCA 2013 Case Study: Keplar GK 110 From GK110: NVIDIA
white paper Keplar SMX Up to two instruction can be issued per
warp
A slice of the SMX From GK110: NVIDIA white paper Up to two
instruction can be issued per warp E.g., LD and SFU More flexible
instruction paring rules More efficient support for atomic
operations in global memory both latency and throughput E.g.,
atomicADD, atomicEXC Shuffle Instruction Permits threads in a warp
to share data
From GK110: NVIDIA white paper From GK110: NVIDIA white paper
Permits threads in a warp to share data Avoid a load-store sequence
Reduce the shared memory requirement per TB increase occupancy Data
exchanged in registers without using shared memory Some operations
become more efficient Memory Hierarchy Configurable cache/shared
memory configuration for L1
warp Configurable cache/sharedmemory configuration forL1 Read-only
cache forcompiler or developer(intrinsics) use Shared L2 across all
SMXs ECC coverage across thehierarchy Performance impact L1 Cache
Shared Memory Read-Only Cache L2 Cache DRAM From GK110: NVIDIA
white paper Dynamic Parallelism The ability for device-side nested
kernel launch
From GK110: NVIDIA white paper The ability for device-side nested
kernel launch Eliminates host-GPU interactions Current overheads
are high Matches a wider range of parallelism patterns willcover in
more depth later Examples of recursive, data dependent parallelism
AMR Can get by with aweaker CPU? Concurrent Kernel Launch
From GK110: NVIDIA white paper Kernels from multiple streams are
now mapped todistinct hardware queues TBs from multiple kernels can
share a SMX Warp and Instruction Dispatch
From GK110: NVIDIA white paper Grid Management Multiple grids
launched from both CPU and GPU canbe handled in Keplar Need the
ability to re-prioritize and schedule newgrids Summary Synchronous
progress of a warp through the SM pipelines
Warp progress in a thread block can diverge for manyreasons
Barriers Control divergence Memory divergence How is the execution
optimized? Next