Introduction to Control Divergence · 2020. 7. 31. · 8 (15) More Complex Example From Fung, et....

1

(1)

Introduction to Control Divergence

Lectures Slides and Figures contributed from sources as noted

(2)

Objectives

• Understand the occurrence of control divergence and the concept of thread reconvergence v Also described as branch divergence and thread divergence

• Cover a basic thread reconvergence mechanism –stack-based reconvergence

• Cover a non-stack based reconvergence mechanism – convergence barriers

• Dynamic Warp formation – inter-warp compactionv Increase lane utilization by merging warps

• Dynamic Warp Formation – intra-warp compactionv Increase lane utilization by compacting threads in a warp

2

(3)

Reading

• W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009,

• T. Aamodt, W. Fung, and T. Rodgers, “General Purpose Graphics Processor Architectures,” Draft Text, Section 3.1.1, 3.1.2

• G. Diamos, R. Johnson, V. Grover, O. Giroux, J. Choquette, M. Fetterman, A. Tirumala, P. Nelson, R. Krashinsky, “Execution of Divergent Threads Using a Convergence Barrier,” U.S. Patent, US 2016/0019066A1, January 2016

• A. S. Vaidya, A. Shayesteh, D. H. Woo, R. Saharoy, M. Azimi, “SIMD Divergence Optimization Through Intra-Warp Compaction,” ISCA 2013

(4)

Handling Branches

• CUDA Code:if(…) … (True for some threads)

else … (True for others)

• What if threads takes different branches?

• Branch Divergence!

TT T T

taken not taken

3

(5)

Branch Divergence

• Occurs within a warp• Branches lead serialization of branch dependent code

v Performance issue: low warp utilization

if(…)

{… }

else {…}

Idle threads

Reconvergence!

Branch Divergence

Courtesy of Wilson Fung, Ivan Sham, George Yuan, Tor Aamodt

Thread Warp Common PCThread

2Thread

3Thread

4Thread

1

B

C D

E

F

A

G

• Different threads follow different control flow paths through the kernel code

• Thread execution is (partially) serializedv Subset of threads that follow the same path execute in

parallel Example:

4

(7)

Basic Idea

• Split: partition a warpv Two mutually exclusive thread subsets, each branching to a

different targetv Identify subsets with two activity masks àeffectively two warps

• Join: merge two subsets of a previously split warpv Reconverge the mutually exclusive sets of threads

• Orchestrate the correct execution for nested branches

• Note the long history of technques in SIMD processors (see background in Fung et. al.)

Thread Warp Common PC

T2 T3 T4T1

0 0 11 activity mask

(8)

Thread Reconvergence• Fundamental problem:

v Merge threads with the same PCv How do we sequence execution of threads? Since this can effect

the ability to reconverge

• Question: When can threads productively reconverge?

• Question: When is the best time to reconverge?

B

C D

E

F

A

G

Must be at same PC

5

(9)

Dominator

• Node d dominates node n if every path from the entry node to n must go through d

B

C D

E

F

A

G

(10)

Immediate Dominator

• Node d immediate dominates node n if every path from the entry node to n must go through d and no other nodes dominate n between d and n

B

C D

E

F

A

G

6

(11)

Post Dominator

• Node d post dominates node n if every path from the node n to the exit node must go through d

B

C D

E

F

A

G

(12)

Immediate Post Dominator

• Node d immediate post dominates node n if every path from node n to the exit node must go through dand no other nodes post dominate n between d and n

B

C D

E

F

A

G

7

Baseline: PDOM


- G 1111TOS

B

C D

E

F

A

G

Thread Warp Common PC

Thread2

Thread3

Thread4

Thread1

B/1111

C/1001 D/0110

E/1111

A/1111

G/1111

- A 1111TOSE D 0110E C 1001TOS

- E 1111E D 0110TOS- E 1111

A D G A

Time

CB E

- B 1111TOS - E 1111TOSReconv. PC Next PC Active Mask

Stack

E D 0110E E 1001TOS

- E 1111

(14)

The Stack-Based Structure

• A stack entry is a specification of a group of active threads that will execute that basic block

• The natural nested structure of control exposes the use of stack-based serialization

B

C D

E

F

A

G

B/1111

C/1001 D/0110

E/1111

A/1111

G/1111

- G 1111- A 1111E D 0110E C 1001

- E 1111E D 0110- E 1111- B 1111- E 1111

Reconv. PC Next PC Active MaskStack

E D 0110E E 1001

- E 1111

Example:

8

(15)

More Complex Example

From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009

• Stack based implementation for nested control flowv Stack entry RPC set to IPDOM

• Re-convergence at the immediate post-dominator of the branch

(16)

Implementation I-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending warps

FromGPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPU-Sim_3.x_Manual#SIMT_Cores

- G 1111TOS - A 1111TOSE D 0110E C 1001TOS

- E 1111E D 0110TOS- E 1111- B 1111- E 1111

Reconv. PC Next PC Active MaskStack

E D 0110E E 1001TOS

- E 1111

• GPGPUSim model: v Implement per warp stack at issue stagev Acquire the active mask and PC from the

TOSv Scoreboard check prior to issuev Register writeback updates scoreboard and

ready bit in instruction bufferv When RPC = Next PC, pop the stack

• Implications for instruction fetch?

9

(17)

Implementation (2)

• warpPC (next instruction) compared to reconvergence PC

• On a branchv Can store the reconvergence PC as part of the branch

instructionv Branch unit has NextPC, TargetPC and reconvergence PC to

update the stack

• On reaching a reconvergence pointv Pop the stackv Continue fetching from the NextPC of the next entry on the

stack

(18)

Stack Based Reconvergence: Summary

• Compatible with the programming model

• Interactions with the schedulerv Uniform progress à efficiency issue

• Conservative v Can we do better, i.e., earlier reconvergence?

• Issues?v Porting MIMD modelsv Interactions between scheduler (forward progress) and

programming model (bulk synchronous execution)

10

(19)

Revisiting SIMT vs. MIMD (1)

• Properties of current MIMD implementationsv Independent progress of a threadv Progress expectations of the scheduler à fairnessv Producer-consumer dependencies between threads

• In general, scheduling orders alone cannot guarantee forward progress

Loosely synchronized threads

e.g., pthreads

RFRF RF RF

SIMT

e.g., PTX, HSA

(20)

Revisiting SIMT vs. MIMD (2)

• Properties of current SIMT implementationsv Serialization of divergent thread execution

v Re-convergence at the earliest point (e.g., @IPDOM)

v Fairness: progress expectations of the scheduler

• Independent thread progress assumption does not applyv Blocking at the reconvergence point

Loosely synchronized threads

e.g., pthreads

RFRF RF RF

SIMT

e.g., PTX, HSA

11

(21)

SIMT Deadlock

• Successful thread blocked at the reconvergence point• Waiting threads blocked on lock release à deadlock• Need progress on divergent threads to continue execution

v Fairness and progress demands on the warp scheduler• Violation of (expected) programming model behaviors?

Figure from T. Aamodt, W. L. ,Fung. R. Rogers, “General Purpose Graphics Architectures”, Synthesis Lectures on Computer Architecture, Draft

(22)

Goals

• Be able to use MIMD semantics à use existing algorithmic idioms v Maintain correctness

• Porting of existing multithreaded applications to GPUs

• Avoid assumptions about compiler optimizations or schedulers

12

(23)

Convergence Barriers: Basic Idea

Barrier 0

Barrier 1

Barrier n-1

Barrier Hardware Structures

• Allocate a named barrier• Add threads to the

barrier

Warp splits reconverge at the

named barrier

Split the warp à two

warps

(24)

Convergence Barriers: Barrier Behavior

• Richer barrier semantics

• Thread states

• Complex conditions for clearing a barrier

• Target for compiler analysis

• Target for warp schedulers

• Allocate a named barrier• Add threads to the

barrier

Warp splits reconverge at the

named barrier

13

(25)

Tracking Data Structures

Barrier Participation Mask

Barrier State

Thread State: Ready, Yield, Blocked, Exited,…

Thread Active (only READY threads)

PC

(26)

Convergence Barrier: Example -1 (1)

• ADD Instruction inserted by compiler to update barrier participation mask

• Named barrier, B0, is allocated

Warp sizebarrier participation mask

barrier state

PCthread active

thread state

Figure from patent US 2016/0019066A1

14

(27)

Convergence Barrier: Example - 1 (2)

warp splitwarp 0

warp 1

Scheduler can schedule either warp independently

active thread

inactive thread


(28)

Convergence Barrier: Example – 1 (3)

warp 0 warp 1

• Warp 0 arrives at barrier B0

• Executes WAIT instruction that changes thread state of active threads to blocked

• Warp 0 waits


barrier state

PC

thread active

thread state


15

(29)

Convergence Barrier: Example – 1 (4)

• Warp 1 arrives at B0 and executes WAIT instruction

• Check participation mask • Update barrier structure• Release the barrier

warp 0 warp 1


barrier state

PCthread active

thread state


(30)

Nested Control Flow (1)

Add threads to barrier B0


16

(31)

Nested Contol Flow (2)

Warp split


(32)


Add 4 active threads to barrier B1


17

(33)


Warp split


(34)


Reconverge via wait


18

(35)


Wait


(36)


All threads arrive


19

(37)


Reconverge


(38)

Iteration (1)

Warp split

• Compiler inserted YIELD instructions

• Threads are suspended and do not participate in the barrier at B1



barrier state

PCthread active

thread state

20

(39)

Iteration (2)

Warp split

• Compiler inserted YIELD instructions

• Threads are suspended and do not participate in the barrier at B1



barrier state

PCthread active

thread state

(40)

Iteration (3)

• Barrier clears with active threads in warp 1

• Ignore threads in yield state

• These threads may continue without the yielded instructions (missed reconvergence op)

• Threads that have executed YIELD, placed in yield state

• Execution suspended

• Scheduler/compiler optimizations to minimize missed reconvergence opportunities


barrier state

PCthread active

thread state


warp 1

warp 0

21

(41)

Iteration (4)

• Compiler responsibility to ensure divergent paths do not execute indefinitely

• Periodically yield to the scheduler• Compiler-scheduler cooperation • Heuristics for insertion of YIELD instructions

vs. conditions, e.g., timeouts


(42)

Early Reconvergence

• Early exit• Threads taking this

path opt-out of the barrier à Opt-Out Instructions

Early reconvergence for remaining threads

Thread state = “EXITED”


22

(43)

Behavior at the Barrier

All threads here?

Remove Opt-out threads

Ignore yielded threads

scheduler

Clear barrier

Thread state is “EXITED

Remaining threads here?

No

Clear YIELD statesClear barrierYes

Yes

No

(44)

Non-Stack Based Reconvergence

• Close relationship between scheduler and compilerv To guarantee progress

• Mechanisms for decoupling execution constraints (e.g., blocking/waiting @reconvergence) from dependenciesv Rely on compiler or hardware (e.g. timers) to decouple

depenedencies

v Performance cost via missed reconvergence opportunities

• Side effect is more scheduling flexibility

23

(45)

Can We Do Better?

• Warps are formed statically

• Key idea of dynamic warp formationv Find a pool of warps àhow they can be merged?

• At a high level what are the requirements?

B

C D

E

F

A

G

(46)

Compaction Techniques

• Can we reform warps so as to increase utilization?

• Basic idea: Compactionv Reform warps with threads that follow the same control flow

pathv Increase utilization of warps

• Two basic types of compaction techniques

• Inter-warp compactionv Group threads from different warpsv Group threads within a warp

o Changing the effective warp size

24

(47)

Inter-Warp Thread Compaction: Dynamic Warp Formation


(48)

Goal

Warp 0

Warp 1

Warp 2

Warp 3

Warp 4

Warp 5

Taken Not Taken

if(…)

{… }

else {…}

Merge threads?

25

(49)

Reading

• W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control Flow on SIMD Graphics Hardware,” ACM TACO, June 2009, Section 4.

(50)

Formation

• How do we get from statically formed warps from kernel launch (recall grid à warp mapping) to dynamic warps drawing threads from multiple warps?

• Constraints imposed by the register file organization

26

DWF: Example


A A B B G G A AC C D D E E F F

Time

A A B B G G A AC D E E F

Time

A x/1111y/1111

B x/1110y/0011

C x/1000y/0010 Dx/0110y/0001 F

x/0001y/1100

E x/1110y/0011

G x/1111y/1111

A new warp created from scalar threads of both Warp x and y executing at Basic Block D

D

Execution of Warp xat Basic Block A

Execution of Warp yat Basic Block A

LegendAA

Baseline

DynamicWarpFormation

(52)

How Does This Work?

• Criteria for mergingv Same PC

v Complements of active threads in each warp

v Recall: many warps/TB all executing the same code

• What information do we need to merge two warpsv Need thread IDs and PCs

• Ideally how would you find/merge warps?

27

DWF: Microarchitecture Implementation


I-Cache

Decode

Com

mit/

Writeback

RF 2

RF 1

ALU 2

ALU 1 (TID, Reg#)

(TID, Reg#)

RF 3ALU 3 (TID, Reg#)


Thread Scheduler PC-Warp LUT Warp Pool IssueLog ic

Warp Allocator

TID x N PC A

TID x N PC B

H

H

TID x NPC PrioTID x NPC Prio

OCCPC IDXOCCPC IDX

Warp Update Register T

Warp Update Register NTREQ

REQTID x N

PC Prio

Assists in aggregating threads

Identify available lanes

Identify occupied lanes

Point to warp being formed

• Warps formed dynamically in the warp pool• After commit check PC-Warp LUT and merge or allocated newly

forming warp

Lane aware

DWF: Microarchitecture Implementation


I-Cache

Decode

Com

mit/

Writeback

RF 2

RF 1

ALU 2

ALU 1 (TID, Reg#)

(TID, Reg#)



Thread Scheduler PC-Warp LUT Warp Pool IssueLog ic

Warp Allocator

TID x N PC A

TID x N PC B

H

H

TID x NPC PrioTID x NPC Prio

OCCPC IDXOCCPC IDX

Warp Update Register T

Warp Update Register NTREQ

REQTID x N

PC PrioA 5 6 7 8A 1 2 3 4

5 7 8

6

B

C

1011

0100

B 2 30110B 0 B 5 2 3 8

B

0010B 2

71

3

4

2 B

C

0110

1001

C 11001C 1 4C 61101C 1

No Lane Conflict

A: BEQ R2, BC: …

X1234

Y5678

X1234

X1234

X1234

X1234

Y5678

Y5678

Y5678

Y5678

Z5238

Z5238

Z5238

28

(55)

Resource Usage

• Ideally would like a small number of unique PCs in

progress at a time à minimize overhead

• Warp divergence will increase the number of unique

PCs

v Mitigate via warp scheduling

• Scheduling policies

v FIFO

v Program counter – address variation measure of divergence

v Majority/Minority- most common vs. helping stragglers

v Post dominator (catch up)

(56)

Hardware Consequences (1)

• Expose the implications that warps have in the base designv Implications for register file access à lane aware DWF

• Register bank conflicts

From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware, ACM TACO, June 2009

29

(57)

Hardware Consequences (2)

Dec

oder

Dec

oder

Dec

oder

Scalar

pipeline

Xbar

Scalar

pipeline

Scalarpipeline

Warp IDTID

Lane AwareIgnoring bank and lane assignments

writeback

(58)

Relaxing Implications of Warps• Thread swizzling

v Essentially remap work to threads so as to create more opportunities for DWF à requires deep understanding of algorithm behavior and data sets

• Lane swizzling in hardwarev Provide limited connectivity between register banks and

lanes à avoiding full crossbars

Warp IDTID

30

(59)

Intra-Warp Thread Compaction: Cycle Compression


(60)

Goals

• Improve utilization in divergent code via intra-warp compaction

• Become familiar with the architecture of Intel’s Gen integrated general purpose GPU architecture

31

(61)

Integrated GPUs: Intel HD Graphics

Figure from The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides

(62)

Integrated GPUs: Intel HD Graphics


Gen graphics processor • 32-byte bidirectional ring• Dedicated coherence

signals

• Coherent distributed cache• Shared with GPU

• Operates as a memory side cache

• Shared physical memory

32

(63)

Inside the Gen9 EU

Thread 0

Architecture Register File (ARF)

Per thread register state

• Up to 7 threads• 128, 256-bit registers/thread (8-way SIMD)• Each thread executes a kernel

v Threads may execute different kernels

v Multi-instruction dispatch


(64)

Operation (1)

• Dispatch 4 instructions from 4 threads

• Constraints between issue slots

• Support both FP and Integer operations

• 4, 32-bit FP operations• 8, 16-bit integer operations• 8, 16-bit FP operations• MAD operations each cycle

• 96 bytes/cycle read BW• 32 bytes/cycle write BW

• Divergence/reconvergence management

SIMD-16, SIMD-8, SIMD-32 instructions

SIMD-4 instructions

transform


33

(65)

Operation (2)


SIMD-4 instructions

transform


SIMD 8

SIMD 16

SIMD 4

Intermix SIMD instructions of various lengths

Thread

(66)

Mapping the BSP Model

Grid 1

Block (0, 0)

Block (1, 1)

Block (1, 0)

Block (0, 1)

Block (1,1)

Thread(0,0,0)Thread

(0,1,3)Thread(0,1,0)

Thread(0,1,1)

Thread(0,1,2)

Thread(0,0,0)

Thread(0,0,1)

Thread(0,0,2)

Thread(0,0,3)

(1,0,0) (1,0,1) (1,0,2) (1,0,3)

• Map multiple threads to SIMD instance executed by a EU Thread

• All threads in a TB or workgroup mapped to same thread (shared memory access)



SIMD-4 instructions

transform

34

(67)

Slice Organization

From The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides

Flexible partitioning• SM• Data cache• Buffers for accelerators

Shared memory• 64Kbyte/slice• Not coherent with other

structures

(68)

Coherent Memory Hierarchy

From The Computer Architecture of the Intel Processor Graphics Gen9, https://software.intel.com/en-us/articles/intel-graphics-developers-guides

• Not coherent

35

(69)

Baseline Microarchitecture OperationI-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending

(70)

Microarchitecture OperationI-Fetch

Decode

RFPRF

D-Cache

DataAll Hit?

Writeback

scalarPipeline

scalarpipeline

scalarpipeline

Issue

I-Buffer

pending

Per-thread pre-fetch

Per-thread scoreboard check

Thread arbitration, dual issue/2-cycles, compaction cntrl

Operand fetch/swizzle• Encode swizzle in RF access

Instruction execution happens in waves of 4-wide operations • Note: variable width SIMD instructions

Per-thread decode : variable width SIMD + EX masks

cycle 0

cycle 2

cycle 1

cycle 3

16-way SIMD

Executed in 4 cycles

36

(71)

Divergence Assessment

Coherent applications

Divergent Applications

SIMD Efficiency

Average SIMD width

Average Lane Utilization

=

(72)

Basic Cycle Compression (1)

• Actual operation depends on data types, execution cycles/op• Note power/energy savings

• Compress cycles by suppressing dispatch

• Operand fetch • Instruction execution

• Complex MRF access patterns

37

(73)

Basic Cycle Compression (2)

Cycle compression applies to- Dispatch (complex MRF accesses

can take multiple cycles- Predication- Other cases with idle cycles

Instruction predicate mask

Dispatch maskDivergence state

Execution Mask

(74)

Basic Cycle Compression: MRF AccessBaseline Register File

Organization for BCC

38

(75)

Swizzle Cycle Compression

• Use more flexible thread àlane assignment• Operand data flow?

• Compact threads to create idle cycles in the pipeline

(76)

Register File Access for BCC

R8.H0R8.H1

32-bits

39

(77)

32-bit Operand Mapping for SIMD16

ADD.Q0 (4) R12.H0 R8.H0 R10.H0ADD.Q1 (4) R12.H1 R8.H1 R10.H1ADD.Q2 (4) R13.H0 R9.H0 R11.H0ADD.Q3 (4) R13.H1 R9.H1 R11.H1

ADD (16) R12 R8 R10 [EXEC MASK 0XFFFF]

R9R8

• Implicitly use pairs of registers for all 32b operands • R8, R9à 16 x 32 = 512 bits

ADD (16) R12 R8 R10

EXEC MASK 0xFFFF

R8.H0R8.H1

256 bits

Quad instructions

Translate

(78)

Operand Flow (1)

ADD.Q1 (4) R12.H1 R8.H1 R10.H1

R9R8

R11R10

R13R12

R10 R10

128 bits, 4-Lane ALU 128 bits, 4-Pane ALU

128-bit buses

40

(79)

Operand Flow (2)

R9R8

R11R10

R13R12

R10 R10ADD.Q1 (4) R12.H1 R8.H1 R10.H1


128-bit buses

(80)

Operand Flow (3)

R9R8

R11R10

R13R12

R10 R10

ADD.Q1 (4) R12.H1 R8.H1 R10.H1


128-bit buses


Quad instructions

41

(81)

Compressing Idle Cycles (1)

ADD (16) R12 R8 R10 [EXEC MASK 0XF0F0]

R9R8

ADD.Q0 (4) R12.H0 R8.H0 R10.H0

R11R10

R13R12

active operands in this instruction


H0H1

(82)

Compressing Idle Cycles(2)

ADD (16) R12 R8 R10 [EXEC MASK 0XF0F0]

R9R8


R11R10

R13R12

active operands in this instruction

Suppressed Compressed

42

(83)

Swizzle Cycle Compression

• Use more flexible thread àlane assignment• Operand data flow?

• Compact threads to create idle cycles in the pipeline

(84)

Distributing Data to Lanes (1)

32-bit operand 32-bit operand 32-bit operand 32-bit operand

• Restrict swizzle patternsv Does not support all possible compression patterns

• Need fast computation of efficient swizzle settings

512 bitquad

Lane Inputs

43

(85)

Distributing Data to Lanes (2)

512 bitquad

4 –Lane ALU

• Each lane connected to 32bits of the 128 bit bus

• Each element of the Quad is connected to 32 bits of a 128-bit segment

(86)

SCC Operation

Can compact across Quads Swizzle settings

overlapped with RF access

Pack lane inputs into a quad

128b

4 lanes

cycle icycle i+1cycle i+2cycle i+3

RF for a Single Operand

• Note increased area, power/energy• Note the need for inverse permutations

44

(87)

Compaction Opportunities

SIMD 16

SIMD 8

Idle lanes

Idle lanes

No further compaction possible

No further compaction possible

For K active threads what is the maximum cycle savings for SIMD N instructions ?

Thread

(88)

Performance Savings

• Difference between saving cycles and saving timev When is #cycles ≠ time?

45

(89)

Summary

• Multi-cycle warp/SIMD/work_group execution

• Optimize #cycles/warp by compressing idle cyclesv Rearrange idle cycles via swizzling to create opportunity

• Sensitivities to the memory interface speedsv Memory bound applications may experience limited benefit

(90)

Intra-Warp Compaction

B

C D

E

F

A

G

Thread Block Thread Block

• Scope limited to within a warp

• Increasing scope means increasing warp size, explicitly, or implicitly (treating multiple warps as a single warp

46

(91)

Summary

• Control flow divergence is a fundamental performance limiter for SIMT execution

• Dynamic warp formation is one way to mitigate these effectsv We will look at several others

• Must balance a complex set of effectsv Memory behaviorsv Synchronization behaviorsv Scheduler behaviors

Introduction to Control Divergence · 2020. 7. 31. · 8 (15) More Complex Example From Fung, et....

Documents

Transcript of Introduction to Control Divergence · 2020. 7. 31. · 8 (15) More Complex Example From Fung, et....