Introduction to Control Divergence · 2020. 7. 31. · 8 (15) More Complex Example From Fung, et....

46
1 (1) Introduction to Control Divergence Lectures Slides and Figures contributed from sources as noted (2) Objectives Understand the occurrence of control divergence and the concept of thread reconvergence v Also described as branch divergence and thread divergence Cover a basic thread reconvergence mechanism – stack-based reconvergence Cover a non-stack based reconvergence mechanism convergence barriers Dynamic Warp formation – inter-warp compaction v Increase lane utilization by merging warps Dynamic Warp Formation – intra-warp compaction v Increase lane utilization by compacting threads in a warp

Transcript of Introduction to Control Divergence · 2020. 7. 31. · 8 (15) More Complex Example From Fung, et....

  • 1

    (1)

    Introduction to Control Divergence

    Lectures Slides and Figures contributed from sources as noted

    (2)

    Objectives

    • Understand the occurrence of control divergence and the concept of thread reconvergence v Also described as branch divergence and thread divergence

    • Cover a basic thread reconvergence mechanism –stack-based reconvergence

    • Cover a non-stack based reconvergence mechanism – convergence barriers

    • Dynamic Warp formation – inter-warp compactionv Increase lane utilization by merging warps

    • Dynamic Warp Formation – intra-warp compactionv Increase lane utilization by compacting threads in a warp

  • 2

    (3)

    Reading

    • W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009,

    • T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text, Section 3.1.1, 3.1.2

    • G. Diamos, R. Johnson, V. Grover, O. Giroux, J. Choquette, M. Fetterman, A. Tirumala, P. Nelson, R. Krashinsky, “Execution of Divergent Threads Using a Convergence Barrier,” U.S. Patent, US 2016/0019066A1, January 2016

    • A. S. Vaidya, A. Shayesteh, D. H. Woo, R. Saharoy, M. Azimi, “SIMD Divergence Optimization Through Intra-Warp Compaction,” ISCA 2013

    (4)

    Handling Branches

    • CUDA Code:if(…) … (True for some threads)

    else … (True for others)

    • What if threads takes different branches?

    • Branch Divergence!

    TT T T

    taken not taken

  • 3

    (5)

    Branch Divergence

    • Occurs within a warp• Branches lead serialization of branch dependent code

    v Performance issue: low warp utilization

    if(…)

    {… }

    else {…}

    Idle threads

    Reconvergence!

    Branch Divergence

    Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

    Thread Warp Common PCThread

    2Thread

    3Thread

    4Thread

    1

    B

    C D

    E

    F

    A

    G

    • Different threads follow different control flow paths through the kernel code

    • Thread execution is (partially) serializedv Subset of threads that follow the same path execute in

    parallel Example:

  • 4

    (7)

    Basic Idea

    • Split: partition a warpv Two mutually exclusive thread subsets, each branching to a

    different targetv Identify subsets with two activity masks àeffectively two warps

    • Join: merge two subsets of a previously split warpv Reconverge the mutually exclusive sets of threads

    • Orchestrate the correct execution for nested branches

    • Note the long history of technques in SIMD processors (see background in Fung et. al.)

    Thread Warp Common PC

    T2 T3 T4T1

    0 0 11 activity mask

    (8)

    Thread Reconvergence• Fundamental problem:

    v Merge threads with the same PCv How do we sequence execution of threads? Since this can effect

    the ability to reconverge

    • Question: When can threads productively reconverge?

    • Question: When is the best time to reconverge?

    B

    C D

    E

    F

    A

    G

    Must be at same PC

  • 5

    (9)

    Dominator

    • Node d dominates node n if every path from the entry node to n must go through d

    B

    C D

    E

    F

    A

    G

    (10)

    Immediate Dominator

    • Node d immediate dominates node n if every path from the entry node to n must go through d and no other nodes dominate n between d and n

    B

    C D

    E

    F

    A

    G

  • 6

    (11)

    Post Dominator

    • Node d post dominates node n if every path from the node n to the exit node must go through d

    B

    C D

    E

    F

    A

    G

    (12)

    Immediate Post Dominator

    • Node d immediate post dominates node n if every path from node n to the exit node must go through dand no other nodes post dominate n between d and n

    B

    C D

    E

    F

    A

    G

  • 7

    Baseline: PDOM

    Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

    - G 1111TOS

    B

    C D

    E

    F

    A

    G

    Thread Warp Common PC

    Thread2

    Thread3

    Thread4

    Thread1

    B/1111

    C/1001 D/0110

    E/1111

    A/1111

    G/1111

    - A 1111TOSE D 0110E C 1001TOS

    - E 1111E D 0110TOS- E 1111

    A D G A

    Time

    CB E

    - B 1111TOS - E 1111TOSReconv. PC Next PC Active Mask

    Stack

    E D 0110E E 1001TOS

    - E 1111

    (14)

    The Stack-Based Structure

    • A stack entry is a specification of a group of active threads that will execute that basic block

    • The natural nested structure of control exposes the use of stack-based serialization

    B

    C D

    E

    F

    A

    G

    B/1111

    C/1001 D/0110

    E/1111

    A/1111

    G/1111

    - G 1111- A 1111E D 0110E C 1001

    - E 1111E D 0110- E 1111- B 1111- E 1111

    Reconv. PC Next PC Active MaskStack

    E D 0110E E 1001

    - E 1111

    Example:

  • 8

    (15)

    More Complex Example

    From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009

    • Stack based implementation for nested control flowv Stack entry RPC set to IPDOM

    • Re-convergence at the immediate post-dominator of the branch

    (16)

    Implementation I-Fetch

    Decode

    RFPRF

    D-Cache

    DataAll Hit?

    Writeback

    scalarPipeline

    scalarpipeline

    scalarpipeline

    Issue

    I-Buffer

    pending warps

    FromGPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores

    - G 1111TOS - A 1111TOSE D 0110E C 1001TOS

    - E 1111E D 0110TOS- E 1111- B 1111- E 1111

    Reconv. PC Next PC Active MaskStack

    E D 0110E E 1001TOS

    - E 1111

    • GPGPUSim model: v Implement per warp stack at issue stagev Acquire the active mask and PC from the

    TOSv Scoreboard check prior to issuev Register writeback updates scoreboard and

    ready bit in instruction bufferv When RPC = Next PC, pop the stack

    • Implications for instruction fetch?

  • 9

    (17)

    Implementation (2)

    • warpPC (next instruction) compared to reconvergence PC

    • On a branchv Can store the reconvergence PC as part of the branch

    instructionv Branch unit has NextPC, TargetPC and reconvergence PC to

    update the stack

    • On reaching a reconvergence pointv Pop the stackv Continue fetching from the NextPC of the next entry on the

    stack

    (18)

    Stack Based Reconvergence: Summary

    • Compatible with the programming model

    • Interactions with the schedulerv Uniform progress à efficiency issue

    • Conservative v Can we do better, i.e., earlier reconvergence?

    • Issues?v Porting MIMD modelsv Interactions between scheduler (forward progress) and

    programming model (bulk synchronous execution)

  • 10

    (19)

    Revisiting SIMT vs. MIMD (1)

    • Properties of current MIMD implementationsv Independent progress of a threadv Progress expectations of the scheduler à fairnessv Producer-consumer dependencies between threads

    • In general, scheduling orders alone cannot guarantee forward progress

    Loosely synchronized threads

    e.g., pthreads

    RFRF RF RF

    SIMT

    e.g., PTX, HSA

    (20)

    Revisiting SIMT vs. MIMD (2)

    • Properties of current SIMT implementationsv Serialization of divergent thread execution

    v Re-convergence at the earliest point (e.g., @IPDOM)

    v Fairness: progress expectations of the scheduler

    • Independent thread progress assumption does not applyv Blocking at the reconvergence point

    Loosely synchronized threads

    e.g., pthreads

    RFRF RF RF

    SIMT

    e.g., PTX, HSA

  • 11

    (21)

    SIMT Deadlock

    • Successful thread blocked at the reconvergence point• Waiting threads blocked on lock release à deadlock• Need progress on divergent threads to continue execution

    v Fairness and progress demands on the warp scheduler• Violation of (expected) programming model behaviors?

    Figure from T. Aamodt, W. L. ,Fung. R. Rogers, “General Purpose Graphics Architectures”, Synthesis Lectures on Computer Architecture, Draft

    (22)

    Goals

    • Be able to use MIMD semantics à use existing algorithmic idioms v Maintain correctness

    • Porting of existing multithreaded applications to GPUs

    • Avoid assumptions about compiler optimizations or schedulers

  • 12

    (23)

    Convergence Barriers: Basic Idea

    Barrier 0

    Barrier 1

    Barrier n-1

    Barrier Hardware Structures

    • Allocate a named barrier• Add threads to the

    barrier

    Warp splits reconverge at the

    named barrier

    Split the warp à two

    warps

    (24)

    Convergence Barriers: Barrier Behavior

    • Richer barrier semantics

    • Thread states

    • Complex conditions for clearing a barrier

    • Target for compiler analysis

    • Target for warp schedulers

    • Allocate a named barrier• Add threads to the

    barrier

    Warp splits reconverge at the

    named barrier

  • 13

    (25)

    Tracking Data Structures

    Barrier Participation Mask

    Barrier State

    Thread State: Ready, Yield, Blocked, Exited,…

    Thread Active (only READY threads)

    PC

    (26)

    Convergence Barrier: Example -1 (1)

    • ADD Instruction inserted by compiler to update barrier participation mask

    • Named barrier, B0, is allocated

    Warp sizebarrier participation mask

    barrier state

    PCthread active

    thread state

    Figure from patent US 2016/0019066A1

  • 14

    (27)

    Convergence Barrier: Example - 1 (2)

    warp splitwarp 0

    warp 1

    Scheduler can schedule either warp independently

    active thread

    inactive thread

    Figure from patent US 2016/0019066A1

    (28)

    Convergence Barrier: Example – 1 (3)

    warp 0 warp 1

    • Warp 0 arrives at barrier B0

    • Executes WAIT instruction that changes thread state of active threads to blocked

    • Warp 0 waits

    Warp sizebarrier participation mask

    barrier state

    PC

    thread active

    thread state

    Figure from patent US 2016/0019066A1

  • 15

    (29)

    Convergence Barrier: Example – 1 (4)

    • Warp 1 arrives at B0 and executes WAIT instruction

    • Check participation mask • Update barrier structure• Release the barrier

    warp 0 warp 1

    Warp sizebarrier participation mask

    barrier state

    PCthread active

    thread state

    Figure from patent US 2016/0019066A1

    (30)

    Nested Control Flow (1)

    Add threads to barrier B0

    Figure from patent US 2016/0019066A1

  • 16

    (31)

    Nested Contol Flow (2)

    Warp split

    Figure from patent US 2016/0019066A1

    (32)

    Nested Control Flow (3)

    Add 4 active threads to barrier B1

    Figure from patent US 2016/0019066A1

  • 17

    (33)

    Nested Control Flow (4)

    Warp split

    Figure from patent US 2016/0019066A1

    (34)

    Nested Control Flow (5)

    Reconverge via wait

    Figure from patent US 2016/0019066A1

  • 18

    (35)

    Nested Control Flow (6)

    Wait

    Figure from patent US 2016/0019066A1

    (36)

    Nested Control Flow (7)

    All threads arrive

    Figure from patent US 2016/0019066A1

  • 19

    (37)

    Nested Control Flow (8)

    Reconverge

    Figure from patent US 2016/0019066A1

    (38)

    Iteration (1)

    Warp split

    • Compiler inserted YIELD instructions

    • Threads are suspended and do not participate in the barrier at B1

    Figure from patent US 2016/0019066A1

    Warp sizebarrier participation mask

    barrier state

    PCthread active

    thread state

  • 20

    (39)

    Iteration (2)

    Warp split

    • Compiler inserted YIELD instructions

    • Threads are suspended and do not participate in the barrier at B1

    Figure from patent US 2016/0019066A1

    Warp sizebarrier participation mask

    barrier state

    PCthread active

    thread state

    (40)

    Iteration (3)

    • Barrier clears with active threads in warp 1

    • Ignore threads in yield state

    • These threads may continue without the yielded instructions (missed reconvergence op)

    • Threads that have executed YIELD, placed in yield state

    • Execution suspended

    • Scheduler/compiler optimizations to minimize missed reconvergence opportunities

    Warp sizebarrier participation mask

    barrier state

    PCthread active

    thread state

    Figure from patent US 2016/0019066A1

    warp 1

    warp 0

  • 21

    (41)

    Iteration (4)

    • Compiler responsibility to ensure divergent paths do not execute indefinitely

    • Periodically yield to the scheduler• Compiler-scheduler cooperation • Heuristics for insertion of YIELD instructions

    vs. conditions, e.g., timeouts

    Figure from patent US 2016/0019066A1

    (42)

    Early Reconvergence

    • Early exit• Threads taking this

    path opt-out of the barrier à Opt-Out Instructions

    Early reconvergence for remaining threads

    Thread state = “EXITED”

    Figure from patent US 2016/0019066A1

  • 22

    (43)

    Behavior at the Barrier

    All threads here?

    Remove Opt-out threads

    Ignore yielded threads

    scheduler

    Clear barrier

    Thread state is “EXITED

    Remaining threads here?

    No

    Clear YIELD statesClear barrierYes

    Yes

    No

    (44)

    Non-Stack Based Reconvergence

    • Close relationship between scheduler and compilerv To guarantee progress

    • Mechanisms for decoupling execution constraints (e.g., blocking/waiting @reconvergence) from dependenciesv Rely on compiler or hardware (e.g. timers) to decouple

    depenedencies

    v Performance cost via missed reconvergence opportunities

    • Side effect is more scheduling flexibility

  • 23

    (45)

    Can We Do Better?

    • Warps are formed statically

    • Key idea of dynamic warp formationv Find a pool of warps àhow they can be merged?

    • At a high level what are the requirements?

    B

    C D

    E

    F

    A

    G

    (46)

    Compaction Techniques

    • Can we reform warps so as to increase utilization?

    • Basic idea: Compactionv Reform warps with threads that follow the same control flow

    pathv Increase utilization of warps

    • Two basic types of compaction techniques

    • Inter-warp compactionv Group threads from different warpsv Group threads within a warp

    o Changing the effective warp size

  • 24

    (47)

    Inter-Warp Thread Compaction: Dynamic Warp Formation

    Lectures Slides and Figures contributed from sources as noted

    (48)

    Goal

    Warp 0

    Warp 1

    Warp 2

    Warp 3

    Warp 4

    Warp 5

    Taken Not Taken

    if(…)

    {… }

    else {…}

    Merge threads?

  • 25

    (49)

    Reading

    • W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009, Section 4.

    (50)

    Formation

    • How do we get from statically formed warps from kernel launch (recall grid à warp mapping) to dynamic warps drawing threads from multiple warps?

    • Constraints imposed by the register file organization

  • 26

    DWF: Example

    Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

    A A B B G G A AC C D D E E F F

    Time

    A A B B G G A AC D E E F

    Time

    A x/1111y/1111

    B x/1110y/0011

    C x/1000y/0010 Dx/0110y/0001 F

    x/0001y/1100

    E x/1110y/0011

    G x/1111y/1111

    A new warp created from scalar threads of both Warp x and y executing at Basic Block D

    D

    Execution of Warp xat Basic Block A

    Execution of Warp yat Basic Block A

    LegendAA

    Baseline

    DynamicWarpFormation

    (52)

    How Does This Work?

    • Criteria for mergingv Same PC

    v Complements of active threads in each warp

    v Recall: many warps/TB all executing the same code

    • What information do we need to merge two warpsv Need thread IDs and PCs

    • Ideally how would you find/merge warps?

  • 27

    DWF: Microarchitecture Implementation

    Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

    I-Cache

    Decode

    Com

    mit/

    Writeback

    RF 2

    RF 1

    ALU 2

    ALU 1 (TID, Reg#)

    (TID, Reg#)

    RF 3ALU 3 (TID, Reg#)

    RF 4ALU 4 (TID, Reg#)

    Thread Scheduler PC-Warp LUT Warp Pool IssueLog ic

    Warp Allocator

    TID x N PC A

    TID x N PC B

    H

    H

    TID x NPC PrioTID x NPC Prio

    OCCPC IDXOCCPC IDX

    Warp Update Register T

    Warp Update Register NTREQ

    REQTID x N

    PC Prio

    Assists in aggregating threads

    Identify available lanes

    Identify occupied lanes

    Point to warp being formed

    • Warps formed dynamically in the warp pool• After commit check PC-Warp LUT and merge or allocated newly

    forming warp

    Lane aware

    DWF: Microarchitecture Implementation

    Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

    I-Cache

    Decode

    Com

    mit/

    Writeback

    RF 2

    RF 1

    ALU 2

    ALU 1 (TID, Reg#)

    (TID, Reg#)

    RF 3ALU 3 (TID, Reg#)

    RF 4ALU 4 (TID, Reg#)

    Thread Scheduler PC-Warp LUT Warp Pool IssueLog ic

    Warp Allocator

    TID x N PC A

    TID x N PC B

    H

    H

    TID x NPC PrioTID x NPC Prio

    OCCPC IDXOCCPC IDX

    Warp Update Register T

    Warp Update Register NTREQ

    REQTID x N

    PC PrioA 5 6 7 8A 1 2 3 4

    5 7 8

    6

    B

    C

    1011

    0100

    B 2 30110B 0 B 5 2 3 8

    B

    0010B 2

    71

    3

    4

    2 B

    C

    0110

    1001

    C 11001C 1 4C 61101C 1

    No Lane Conflict

    A: BEQ R2, BC: …

    X1234

    Y5678

    X1234

    X1234

    X1234

    X1234

    Y5678

    Y5678

    Y5678

    Y5678

    Z5238

    Z5238

    Z5238

  • 28

    (55)

    Resource Usage

    • Ideally would like a small number of unique PCs in

    progress at a time à minimize overhead

    • Warp divergence will increase the number of unique

    PCs

    v Mitigate via warp scheduling

    • Scheduling policies

    v FIFO

    v Program counter – address variation measure of divergence

    v Majority/Minority- most common vs. helping stragglers

    v Post dominator (catch up)

    (56)

    Hardware Consequences (1)

    • Expose the implications that warps have in the base designv Implications for register file access à lane aware DWF

    • Register bank conflicts

    From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009

  • 29

    (57)

    Hardware Consequences (2)

    Dec

    oder

    Dec

    oder

    Dec

    oder

    Scalar

    pipeline

    Xbar

    Scalar

    pipeline

    Scalarpipeline

    Warp IDTID

    Lane AwareIgnoring bank and lane assignments

    writeback

    (58)

    Relaxing Implications of Warps• Thread swizzling

    v Essentially remap work to threads so as to create more opportunities for DWF à requires deep understanding of algorithm behavior and data sets

    • Lane swizzling in hardwarev Provide limited connectivity between register banks and

    lanes à avoiding full crossbars

    Warp IDTID

  • 30

    (59)

    Intra-Warp Thread Compaction: Cycle Compression

    Lectures Slides and Figures contributed from sources as noted

    (60)

    Goals

    • Improve utilization in divergent code via intra-warp compaction

    • Become familiar with the architecture of Intel’s Gen integrated general purpose GPU architecture

  • 31

    (61)

    Integrated GPUs: Intel HD Graphics

    Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides

    (62)

    Integrated GPUs: Intel HD Graphics

    Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides

    Gen graphics processor • 32-byte bidirectional ring• Dedicated coherence

    signals

    • Coherent distributed cache• Shared with GPU

    • Operates as a memory side cache

    • Shared physical memory

  • 32

    (63)

    Inside the Gen9 EU

    Thread 0

    Architecture Register File (ARF)

    Per thread register state

    • Up to 7 threads• 128, 256-bit registers/thread (8-way SIMD)• Each thread executes a kernel

    v Threads may execute different kernels

    v Multi-instruction dispatch

    Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides

    (64)

    Operation (1)

    • Dispatch 4 instructions from 4 threads

    • Constraints between issue slots

    • Support both FP and Integer operations

    • 4, 32-bit FP operations• 8, 16-bit integer operations• 8, 16-bit FP operations• MAD operations each cycle

    • 96 bytes/cycle read BW• 32 bytes/cycle write BW

    • Divergence/reconvergence management

    SIMD-16, SIMD-8, SIMD-32 instructions

    SIMD-4 instructions

    transform

    Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides

  • 33

    (65)

    Operation (2)

    SIMD-16, SIMD-8, SIMD-32 instructions

    SIMD-4 instructions

    transform

    Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides

    SIMD 8

    SIMD 16

    SIMD 4

    Intermix SIMD instructions of various lengths

    Thread

    (66)

    Mapping the BSP Model

    Grid 1

    Block (0, 0)

    Block (1, 1)

    Block (1, 0)

    Block (0, 1)

    Block (1,1)

    Thread(0,0,0)Thread

    (0,1,3)Thread(0,1,0)

    Thread(0,1,1)

    Thread(0,1,2)

    Thread(0,0,0)

    Thread(0,0,1)

    Thread(0,0,2)

    Thread(0,0,3)

    (1,0,0) (1,0,1) (1,0,2) (1,0,3)

    • Map multiple threads to SIMD instance executed by a EU Thread

    • All threads in a TB or workgroup mapped to same thread (shared memory access)

    Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides

    SIMD-16, SIMD-8, SIMD-32 instructions

    SIMD-4 instructions

    transform

  • 34

    (67)

    Slice Organization

    From The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides

    Flexible partitioning• SM• Data cache• Buffers for accelerators

    Shared memory• 64Kbyte/slice• Not coherent with other

    structures

    (68)

    Coherent Memory Hierarchy

    From The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides

    • Not coherent

  • 35

    (69)

    Baseline Microarchitecture OperationI-Fetch

    Decode

    RFPRF

    D-Cache

    DataAll Hit?

    Writeback

    scalarPipeline

    scalarpipeline

    scalarpipeline

    Issue

    I-Buffer

    pending

    (70)

    Microarchitecture OperationI-Fetch

    Decode

    RFPRF

    D-Cache

    DataAll Hit?

    Writeback

    scalarPipeline

    scalarpipeline

    scalarpipeline

    Issue

    I-Buffer

    pending

    Per-thread pre-fetch

    Per-thread scoreboard check

    Thread arbitration, dual issue/2-cycles, compaction cntrl

    Operand fetch/swizzle• Encode swizzle in RF access

    Instruction execution happens in waves of 4-wide operations • Note: variable width SIMD instructions

    Per-thread decode : variable width SIMD + EX masks

    cycle 0

    cycle 2

    cycle 1

    cycle 3

    16-way SIMD

    Executed in 4 cycles

  • 36

    (71)

    Divergence Assessment

    Coherent applications

    Divergent Applications

    SIMD Efficiency

    Average SIMD width

    Average Lane Utilization

    =

    (72)

    Basic Cycle Compression (1)

    • Actual operation depends on data types, execution cycles/op• Note power/energy savings

    • Compress cycles by suppressing dispatch

    • Operand fetch • Instruction execution

    • Complex MRF access patterns

  • 37

    (73)

    Basic Cycle Compression (2)

    Cycle compression applies to- Dispatch (complex MRF accesses

    can take multiple cycles- Predication- Other cases with idle cycles

    Instruction predicate mask

    Dispatch maskDivergence state

    Execution Mask

    (74)

    Basic Cycle Compression: MRF AccessBaseline Register File

    Organization for BCC

  • 38

    (75)

    Swizzle Cycle Compression

    • Use more flexible thread àlane assignment• Operand data flow?

    • Compact threads to create idle cycles in the pipeline

    (76)

    Register File Access for BCC

    R8.H0R8.H1

    32-bits

  • 39

    (77)

    32-bit Operand Mapping for SIMD16

    ADD.Q0 (4) R12.H0 R8.H0 R10.H0ADD.Q1 (4) R12.H1 R8.H1 R10.H1ADD.Q2 (4) R13.H0 R9.H0 R11.H0ADD.Q3 (4) R13.H1 R9.H1 R11.H1

    ADD (16) R12 R8 R10 [EXEC MASK 0XFFFF]

    R9R8

    • Implicitly use pairs of registers for all 32b operands • R8, R9à 16 x 32 = 512 bits

    ADD (16) R12 R8 R10

    EXEC MASK 0xFFFF

    R8.H0R8.H1

    256 bits

    Quad instructions

    Translate

    (78)

    Operand Flow (1)

    ADD.Q1 (4) R12.H1 R8.H1 R10.H1

    R9R8

    R11R10

    R13R12

    R10 R10

    128 bits, 4-Lane ALU 128 bits, 4-Pane ALU

    128-bit buses

  • 40

    (79)

    Operand Flow (2)

    R9R8

    R11R10

    R13R12

    R10 R10ADD.Q1 (4) R12.H1 R8.H1 R10.H1

    128 bits, 4-Lane ALU 128 bits, 4-Pane ALU

    128-bit buses

    (80)

    Operand Flow (3)

    R9R8

    R11R10

    R13R12

    R10 R10

    ADD.Q1 (4) R12.H1 R8.H1 R10.H1

    128 bits, 4-Lane ALU 128 bits, 4-Pane ALU

    128-bit buses

    ADD.Q0 (4) R12.H0 R8.H0 R10.H0ADD.Q1 (4) R12.H1 R8.H1 R10.H1ADD.Q2 (4) R13.H0 R9.H0 R11.H0ADD.Q3 (4) R13.H1 R9.H1 R11.H1

    Quad instructions

  • 41

    (81)

    Compressing Idle Cycles (1)

    ADD (16) R12 R8 R10 [EXEC MASK 0XF0F0]

    R9R8

    ADD.Q0 (4) R12.H0 R8.H0 R10.H0

    R11R10

    R13R12

    active operands in this instruction

    ADD.Q0 (4) R12.H0 R8.H0 R10.H0ADD.Q1 (4) R12.H1 R8.H1 R10.H1ADD.Q2 (4) R13.H0 R9.H0 R11.H0ADD.Q3 (4) R12.H1 R8.H1 R10.H1

    H0H1

    (82)

    Compressing Idle Cycles(2)

    ADD (16) R12 R8 R10 [EXEC MASK 0XF0F0]

    R9R8

    ADD.Q0 (4) R12.H0 R8.H0 R10.H0ADD.Q1 (4) R12.H1 R8.H1 R10.H1ADD.Q2 (4) R13.H0 R9.H0 R11.H0ADD.Q3 (4) R12.H1 R8.H1 R10.H1

    R11R10

    R13R12

    active operands in this instruction

    Suppressed Compressed

  • 42

    (83)

    Swizzle Cycle Compression

    • Use more flexible thread àlane assignment• Operand data flow?

    • Compact threads to create idle cycles in the pipeline

    (84)

    Distributing Data to Lanes (1)

    32-bit operand 32-bit operand 32-bit operand 32-bit operand

    • Restrict swizzle patternsv Does not support all possible compression patterns

    • Need fast computation of efficient swizzle settings

    512 bitquad

    Lane Inputs

  • 43

    (85)

    Distributing Data to Lanes (2)

    512 bitquad

    4 –Lane ALU

    • Each lane connected to 32bits of the 128 bit bus

    • Each element of the Quad is connected to 32 bits of a 128-bit segment

    (86)

    SCC Operation

    Can compact across Quads Swizzle settings

    overlapped with RF access

    Pack lane inputs into a quad

    128b

    4 lanes

    cycle icycle i+1cycle i+2cycle i+3

    RF for a Single Operand

    • Note increased area, power/energy• Note the need for inverse permutations

  • 44

    (87)

    Compaction Opportunities

    SIMD 16

    SIMD 8

    Idle lanes

    Idle lanes

    No further compaction possible

    No further compaction possible

    For K active threads what is the maximum cycle savings for SIMD N instructions ?

    Thread

    (88)

    Performance Savings

    • Difference between saving cycles and saving timev When is #cycles ≠ time?

  • 45

    (89)

    Summary

    • Multi-cycle warp/SIMD/work_group execution

    • Optimize #cycles/warp by compressing idle cyclesv Rearrange idle cycles via swizzling to create opportunity

    • Sensitivities to the memory interface speedsv Memory bound applications may experience limited benefit

    (90)

    Intra-Warp Compaction

    B

    C D

    E

    F

    A

    G

    Thread Block Thread Block

    • Scope limited to within a warp

    • Increasing scope means increasing warp size, explicitly, or implicitly (treating multiple warps as a single warp

  • 46

    (91)

    Summary

    • Control flow divergence is a fundamental performance limiter for SIMT execution

    • Dynamic warp formation is one way to mitigate these effectsv We will look at several others

    • Must balance a complex set of effectsv Memory behaviorsv Synchronization behaviorsv Scheduler behaviors