CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread...
Transcript of CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread...
CS 758: Advanced Topics in Computer Architecture
Lecture #8: GPU Warp Scheduling Research + DRAM Basics
Professor Matthew D. Sinclair
Some of these slides were developed by Tim Rogers at the Purdue University, Tor Aamodt at the University of British Columbia, Wen-mei Hwu & David Kirk at the University of Illinois at Urbana-Champaign, Sudhakar Yalamanchili Georgia Tech, and Prof. Onur Mutlu at Carnegie Mellon University.
Slides enhanced by Matt Sinclair
Studying Warp Scheduling on GPUs
• Numerous works on manipulating different schedulers
• Most looking at the SM-side issue-level warp scheduler
• Some look at TB scheduling at the TB-core level
• Fetch scheduler and operand collector schedule less studied• Fetch largely follows issue.
• Not clear what the opportunity in the operand collector is.• Even an opportunity study here would be helpful.
Use Memory System Feedback[MICRO 2012]
3
0
0.5
1
1.5
2
0
10
20
30
40
Threads Actively Scheduled
Performance
Cache Misses
GPU Core
Processor CacheThread
Scheduler
Feedback
Programmability case study [MICRO 2013]Sparse Vector-Matrix Multiply
Simple Version
GPU-Optimized VersionSHOC Benchmark Suite
(Oakridge National Labs)
4
Added Complication
Dependent on Warp Size
Parallel Reduction
Explicit Scratchpad UseDivergence
Each thread has locality
Using DAWS scheduling
within 4% of optimized with no programmer input
Data CacheData Cache
Sources of Locality
Intra-wavefront locality Inter-wavefront locality
LD $line (X)
LD $line (X)
LD $line (X)
LD $line (X)
Wave0
Hit
Wave0 Wave1
Hit
5
0
20
40
60
80
100
120
AVG-Highly Cache Sensitive
(Hit
s/M
iss)
PK
I
Misses PKI
Inter-Wavefront Hits PKI
Intra-Wavefront Hits PKI
6
Scheduler affects access pattern
Memory
System
Wavefront
Scheduler
Wavefront
Scheduler
Round Robin Scheduler
Memory
System
Greedy then Oldest Scheduler
ld A,B,C,D…
DCBA
ld Z,Y,X,Wld A,B,C,D
WXYZ
... ...
ld Z,Y,X,W
DCBA
DCBA
ld A,B,C,D…
...
Wave0 Wave1 Wave0 Wave1
ld A,B,C,D…
7
Use scheduler to shape access pattern
Memory
System
Wavefront
Scheduler
Wavefront
Scheduler
Greedy then Oldest Scheduler
Memory
System
Cache-Conscious Wavefront Scheduling[MICRO 2012 best paper runner up,
Top Picks 2013, CACM Research Highlight]
ld A,B,C,D
DCBA
ld A,B,C,D
WXYZ
...
ld Z,Y,X,W
DCBA
DCBA
ld A,B,C,D…
...
ld Z,Y,X,W…
Wave0 Wave1 Wave0 Wave1
ld A,B,C,D…
8
Memory Unit
Cache
Victim Tags
Locality Scoring
SystemWave
Scheduler
W0
W1
W2
Tag WID Data
Tag
Tag
Tag
Tag
Tag
Tag
W0
W1
W2
Time
Score
Tag WID Data
…
W0
W1
W2
No W2
loads
W0
W1
W2
…W0: ld X
X 0
W0,X X
W0
detected lost locality
W2: ld YW0: ld X
ProbeW0,X
Y 2
9
0
0.5
1
1.5
2
HMEAN-Highly Cache-Sensitive
Spe
ed
up
LRR GTO CCWS
10
Static Wavefront Limiting[Rogers et al., MICRO 2012]
• Profiling an application we can find an optimal number of wavefrontsto execute
• Does a little better than CCWS.
• Limitations: Requires profiling, input dependent, does not exploit phase behavior.
11
Improve upon CCWS?
• CCWS detects bad scheduling decisions and avoids them in future.
• Would be better if we could “think ahead” / “be proactive” instead of “being reactive”
12
(13)
Divergence Aware Warp Scheduling T. Rogers, M
O’Conner, and T. Aamodt MICRO 2013 Goal
• Design a scheduler to match #scheduled wavefronts with the L1 cache
size
❖ Working set of the wavefronts fits in the cache
❖ Emphasis on intra-wavefront locality
• Differs from CCWS in being proactive
❖ Deeper look at what happens inside loops
❖ Proactive
❖ Explicitly Handles Divergence (both memory and control flow)
(14)
Key Idea
• Manage the relationship between control divergence, memory divergence and scheduling
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(15)
Key Idea (2)
while (i <C[tid+1])
warp 0
warp 1
Fill the cache with 4 references –delay warp 1Divergent
branch
Intra-thread locality
Available room in the cache,
schedule warp 1
Use warp 0 behavior to predict interference due to warp 1
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(16)
Goal
Simpler portable version GPU-Optimized Version
Make the performance equivalent
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(17)
Observation
• Bulk of the accesses in a loop come from a few static
load instructions
• Bulk of the locality in (these) applications is intra-loop
Loops
(18)
Distribution of Locality
Bulk of the locality comes form a few static loads in loops
Hint: Can we keep data from last iteration?
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(19)
A Solution
• Prediction mechanisms for
locality across iterations of
a loop
• Schedule such that data
fetched in one iteration is
still present at next
iteration
• Combine with control flow
divergence (how much of
the footprint needs to be in
the cache?)
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(20)
Classification of Dynamic Loads
• Group static loads into equivalence classes →
reference the same cache line
• Identify these groups by repetition ID
• Prediction for each load by compiler or hardware
converged
Diverged
Somewhat diverged
Control flow divergence
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(21)
Predicting a Warp’s Cache Footprint• Entering loop body• Create footprint prediction
Warp
• Exit loop• Reinitialize prediction
• Some threads exit the loop• Predicted footprint drops
Warp
• Predict locality usage of static loads
❖ Not all loads increase the footprint
• Combine with control divergence to predict footprint
• Use footprint to throttle/not-throttle warp issue
Taken
Not Taken
(22)
Principles of Operation
• Prefix sum of each warp’s cache footprint used to
select warps that can be issued
EffCacheSize= kAssocFactor.TotalNumLines
• Scaling back from a fully associative cache• Empirically determined
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(23)
Principles of Operation (2)
• Profile static load
instructions
❖ Are they divergent?
❖ Loop repetition ID
o Assume all loads with
same base address and
offset within cache line
access are repeated each
iteration
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(24)
Prediction Mechanisms
• Profiled Divergence Aware Scheduling (DAWS)
❖ Used offline profile results to dynamically determine de-scheduling decisions
• Detected Divergence Aware Scheduling (DAWS)
❖ Behaviors derived at run-time to drive de-scheduling decisions
o Loops that exhibit intra-warp locality
o Static loads are characterized as divergent or convergent
(25)
Extensions for DAWS
I-Fetch
Decode
OC
PRF
D-Cache
DataAll Hit?
Writeback
scala
rPip
elin
e
scala
rpip
elin
e
scala
rpip
elin
e
Issue
I-Buffer
+
+
(26)
Operation: Tracking
Basis for throttling
Profile-based information
One entry per warp issue
slot
Created/removed at
loop begin/end
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
#Active Lanes
(27)
Operation: Prediction
Sum results from static loads in this loop
- Add #active-lanes of cache lines for divergent loads
- Add 2 for converged loads- Count loads in the same
equivalence class only once (unless divergent)
• Generally only considering de-scheduling warps in loops
❖ Since most of the activity is here
• Can be extended to non-loop regions by associating non-loop
code with next loop
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(28)
Operation: Nested Loops
for (i=0; i<limitx; i++)}{....
for (j=0: j<limity; j++){....}
..
..}
• On-entry update prediction to that of inner loop
• On re-entry predict based inner loop predictions
On-exit, do not clear prediction
De-scheduling of warps determined
by inner loop behaviors!
• Re-used predictions based on inner-most loops which
is where most of the date re-use is found
(29)
Detected DAWS: Prediction
Sampling warp for the loop (>2
active threads)
• Detect both memory divergence and intra-loop
repetition at run time
• Fill PCLoad entries based on run time information
if locality
enter load
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(30)
Detected DAWS: Classification
Increment or decrement the counter depending on #memory accesses for a load
Create equivalence classes of loads (checking the PCs)
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(31)
Performance
Little to no degradation
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(32)
Performance
Significant intra-warp locality
SPMV-scalar normalized to best
SPMV-vector
Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013
(33)
Summary
• If we can characterize warp level memory reference locality, we can use
this information to minimize interference in the cache through scheduling
constraints
• Proactive scheme outperforms reactive management
• Understand interactions between memory divergence and control
divergence
(34)
OWL: Cooperative Thread Array Aware Scheduling
Techniques for Improving GPGPU Performance
A. Jog et. al ASPLOS 2013 Goal
• Understand memory effects of scheduling from deeper within the memory
hierarchy
• Minimize idle cycles induced by stalling warps waiting on memory
references
Off-chip Bandwidth is Critical!
35
Percentage of total execution cycles wasted waiting for the
data to come back from DRAM
0%
20%
40%
60%
80%
100%
SA
D
PV
C
SS
C
BF
S
MU
M
CF
D
KM
N
SC
P
FW
T
IIX
SP
MV
JP
EG
BF
SR
SC
FF
T
SD
2
WP
PV
R
BP
CO
N
AE
S
SD
1
BL
K
HS
SL
A
DN
LP
S
NN
PF
N
LY
TE
LU
D
MM
ST
O
CP
NQ
U
CU
TP
HW
TP
AF
AV
G
AV
G-T
1
Type-1
Applications
55%AVG: 32%
Type-2
Applications
GPGPU Applications
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
(36)
Source of Idle Cycles
• Warps stalled on waiting for memory reference
❖ Cache miss
❖ Service at the memory controller
❖ Row buffer miss in DRAM
❖ Latency in the network (not addressed in this paper)
• The last warp effect
• The last CTA effect
• Lack of multiprogrammed execution
❖ One (small) kernel at a time
(37)
Impact of Idle Cycles
Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
High-Level View of a GPU
DRAM
SIMT Cores
Scheduler
ALUsL1 Caches
Threads
WW W W W W
Warps
L2 cache
Interconnect
CTA CTA CTA CTA
Cooperative
Thread
Arrays
(CTAs)
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
Warp Scheduler
ALUsL1 Caches
CTA-Assignment Policy (Example)
39
Warp Scheduler
ALUsL1 Caches
Multi-threaded CUDA Kernel
SIMT Core-1 SIMT Core-2
CTA-1 CTA-2 CTA-3 CTA-4
CTA-2 CTA-4CTA-1 CTA-3
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
(40)
Organizing CTAs Into Groups
• Set minimum number of warps equal to #pipeline stages
❖ Same philosophy as the two-level warp scheduler
• Use same CTA grouping/numbering across SMs?
Warp Scheduler
ALUsL1 Caches
Warp Scheduler
ALUsL1 Caches
SIMT Core-1 SIMT Core-2
CTA-2 CTA-4CTA-1 CTA-3
Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
Warp Scheduling Policy◼ All launched warps on a SIMT core have equal priority
❑ Round-Robin execution
◼ Problem: Many warps stall at long latency operations
roughly at the same time
41
WW
CTA
WW
CTA
WW
CTA
WW
CTA
All warps compute
All warps have equal priority
WW
CTA
WW
CTA
WW
CTA
WW
CTA
All warps compute
All warps have equal priority
Send Memory Requests
SIMT
Core
Stalls
Time
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
Solution
42
WW
CTA
WW
CTA
WW
CTA
WW
CTA
WW
CTA
WW
CTA
WW
CTA
WW
CTA
Send Memory Requests
Saved Cycles
WW
CTA
WW
CTA
WW
CTA
WW
CTA
All warps compute
All warps have equal priority
WW
CTA
WW
CTA
WW
CTA
WW
CTA
All warps compute
All warps have equal priority
Send Memory Requests
SIMT Core
Stalls
Time
• Form Warp-Groups
(Narasiman MICRO’11)
• CTA-Aware grouping
• Group Switch is Round-Robin
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
(43)
Two Level Round Robin Scheduler
CTA0 CTA1
CTA3 CTA2
CTA4 CTA5
CTA7 CTA8
CTA12 CTA13
CTA15 CTA14
Group 0 Group 1
Group 3Group 2
RR
RR
Thread
Agnostic to whenpending misses
are satisfied
Objective 1: Improve Cache Hit Rates
44
CTA 1 CTA 3 CTA 5 CTA 7CTA 1 CTA 3 CTA 5 CTA 7
Data for CTA1 arrives.
No switching.
CTA 3 CTA 5 CTA 7CTA 1 CTA 3 C5 CTA 7CTA 1 C5
Data for CTA1 arrives.
TNo Switching: 4 CTAs in Time T
Switching: 3 CTAs in Time T
Fewer CTAs accessing the cache concurrently → Less cache contention
Time
Switch to CTA1.
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
Reduction in L1 Miss Rates
45
◼ Limited benefits for cache insensitive applications
◼ What is happening deeper in the memory system?
0.00
0.20
0.40
0.60
0.80
1.00
1.20
SAD SSC BFS KMN IIX SPMV BFSR AVG.No
rmal
ized
L1
Mis
s R
ates
8%
18%
Round-Robin CTA-Grouping CTA-Grouping-Prioritization
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
(46)
The Off-Chip Memory Path
Off-chip GDDR5
Off-chip GDDR5
CU 0
To
CU 15
CU 16
To
CU 31
MC
MC
MC
MC
MC MC
Off-chip GDDR5
Off-chip GDDR5
Off-chip GDDR5
Off-chip GDDR5
Access patterns?
Ordering and buffering?
(47)
Inter-CTA Locality
Warp Scheduler
ALUsL1 Caches
Warp Scheduler
ALUsL1 Caches
CTA-2 CTA-4CTA-1 CTA-3
DRAM DRAM DRAM DRAM
How do CTAs Interact at the MC and in DRAM?
(48)
Impact of the Memory Controller
• Memory scheduling
policies
❖ Optimize BW vs. memory
latency
• Impact of row buffer
access locality
• Cache lines?
(49)
Row Buffer Locality
The DRAM Subsystem
(51)
DRAM Subsystem Organization
• Channel
• DIMM
• Rank
• Chip
• Bank
• Row/Column
51
(52)
Page Mode DRAM
• A DRAM bank is a 2D array of cells: rows x columns
• A “DRAM row” is also called a “DRAM page”
• “Sense amplifiers” also called “row buffer”
• Each address is a <row,column> pair
• Access to a “closed row”❖ Activate command opens row (placed into row buffer)
❖ Read/write command reads/writes column in the row buffer
❖ Precharge command closes the row and prepares the bank for next access
• Access to an “open row”❖ No need for activate command
52
(53)
DRAM Bank Operation
53
Row Buffer
(Row 0, Column 0)
Row
decoder
Column mux
Row address 0
Column address 0
Data
Row 0Empty
(Row 0, Column 1)
Column address 1
(Row 0, Column 85)
Column address 85
(Row 1, Column 0)
HITHIT
Row address 1
Row 1
Column address 0
CONFLICT !
Columns
Row
s
Access Address:
(54)
The DRAM Chip
• Consists of multiple banks (2-16 in Synchronous DRAM)
• Banks share command/address/data buses
• The chip itself has a narrow interface (4-16 bits per read)
54
(55)
128M x 8-bit DRAM Chip
55
(56)
DRAM Rank and Module
• Rank: Multiple chips operated together to form a wide interface
• All chips comprising a rank are controlled at the same time
❖ Respond to a single command
❖ Share address and command buses, but provide different data
❖ Like DRAM “SIMD”
• A DRAM module consists of one or more ranks
❖ E.g., DIMM (dual inline memory module)
❖ This is what you plug into your motherboard
• If we have chips with 8-bit interface, to read 8 bytes in a single access, use 8 chips in a DIMM
56
(57)
A 64-bit Wide DIMM (One Rank)
57
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
DRAM
Chip
Command Data
(58)
Multiple DIMMs
58
• Advantages:
❖ Enables even higher capacity
• Disadvantages:
❖ Interconnect complexity and energy consumption can be high
(59)
DRAM Channels
• 2 Independent Channels: 2 Memory Controllers (Above)
• 2 Dependent/Lockstep Channels: 1 Memory Controller with wide interface (Not Shown above)59
(60)
Generalized Memory Structure
60
(61)
Generalized Memory Structure
61
The DRAM Subsystem
The Top Down View
(63)
DRAM Subsystem Organization
• Channel
• DIMM
• Rank
• Chip
• Bank
• Row/Column
63
(64)
The DRAM subsystem
Memory channel Memory channel
DIMM (Dual in-line memory module)
Processor
“Channel”
(65)
Breaking down a DIMM
DIMM (Dual in-line memory module)
Side view
Front of DIMM Back of DIMM
(66)
Breaking down a DIMM
DIMM (Dual in-line memory module)
Side view
Front of DIMM Back of DIMM
Rank 0: collection of 8 chips Rank 1
(67)
Rank
Rank 0 (Front) Rank 1 (Back)
Data <0:63>CS <0:1>Addr/Cmd
<0:63><0:63>
Memory channel
(68)
Breaking down a Rank
Rank 0
<0:63>
Ch
ip 0
Ch
ip 1
Ch
ip 7. . .
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
(69)
Breaking down a Chip
Ch
ip 0
<0:7
>
Bank 0
<0:7>
<0:7>
<0:7>
...
<0:7
>
(70)
Breaking down a Bank
Bank 0
<0:7
>
row 0
row 16k-1
...
2kB
1B
1B (column)
1B
Row-buffer
1B
...
<0:7
>
(71)
DRAM Subsystem Organization
• Channel
• DIMM
• Rank
• Chip
• Bank
• Row/Column
71
(72)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Channel 0
DIMM 0
Rank 0
(73)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
. . .
(74)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
Row 0Col 0
. . .
(75)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
8B
Row 0Col 0
. . .
8B
(76)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
8B
Row 0Col 1
. . .
(77)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
8B
8B
Row 0Col 1
. . .
8B
(78)
Example: Transferring a cache block
0xFFFF…F
0x00
0x40
...
64B cache block
Physical memory space
Rank 0Chip 0 Chip 1 Chip 7
<0:7
>
<8:1
5>
<56
:63
>
Data <0:63>
8B
8B
Row 0Col 1
A 64B cache block takes 8 I/O cycles to transfer.
During the process, 8 columns are read sequentially.
. . .
(79)
Latency Components: Basic DRAM Operation
• CPU → controller transfer time
• Controller latency
❖ Queuing & scheduling delay at the controller
❖ Access converted to basic commands
• Controller → DRAM transfer time
• DRAM bank latency
❖ Simple CAS if row is “open” OR
❖ RAS + CAS if array precharged OR
❖ PRE + RAS + CAS (worst case)
• DRAM → CPU transfer time (through controller)
79
(80)
Multiple Banks (Interleaving) and Channels
• Multiple banks
❖ Enable concurrent DRAM accesses
❖ Bits in address determine which bank an address resides in
• Multiple independent channels serve the same purpose
❖ But they are even better because they have separate data buses
❖ Increased bus bandwidth
• Enabling more concurrency requires reducing
❖ Bank conflicts
❖ Channel conflicts
• How to select/randomize bank/channel indices in address?
❖ Lower order bits have more entropy
❖ Randomizing hash functions (XOR of different address bits)
80
(81)
How Multiple Banks/Channels Help
81
(82)
Multiple Channels
• Advantages
❖ Increased bandwidth
❖ Multiple concurrent accesses (if independent channels)
• Disadvantages
❖ Higher cost than a single channel
o More board wires
o More pins (if on-chip memory controller)
82
(83)
Address Mapping (Single Channel)
• Single-channel system with 8-byte memory bus
❖ 2GB memory, 8 banks, 16K rows & 2K columns per bank
• Row interleaving
❖ Consecutive rows of memory in consecutive banks
• Cache block interleavingo Consecutive cache block addresses in consecutive banks
o 64 byte cache blocks
o Accesses to consecutive cache blocks can be serviced in parallel
o How about random accesses? Strided accesses?83
Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)
Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)
3 bits8 bits
(84)
Bank Mapping Randomization
• DRAM controller can randomize the address mapping to banks so that bank conflicts are less likely
84
Column (11 bits)3 bits Byte in bus (3 bits)
XOR
Bank index
(3 bits)
(85)
Address Mapping (Multiple Channels)
• Where are consecutive cache blocks?
85
Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C
Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C
Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C
Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C
Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)
3 bits8 bits
C
Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)
3 bits8 bits
C
Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)
3 bits8 bits
C
Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)
3 bits8 bits
C
Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)
3 bits8 bits
C
(86)
Interaction with Virtual→Physical Mapping
• Operating System influences where an address maps to in DRAM
• Operating system can control which bank/channel/rank a virtual page is mapped to.
• It can perform page coloring to minimize bank conflicts
• Or to minimize inter-application interference
86
Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)
Page offset (12 bits)Physical Frame number (19 bits)
Page offset (12 bits)Virtual Page number (52 bits) VA
PA
PA
(87)
DRAM Refresh (I)
• DRAM capacitor charge leaks over time
• The memory controller needs to read each row periodically to restore the charge
❖ Activate + precharge each row every N ms
❖ Typical N = 64 ms
• Implications on performance?
-- DRAM bank unavailable while refreshed
-- Long pause times: If we refresh all rows in burst, every 64ms the DRAM will be unavailable until refresh ends
• Burst refresh: All rows refreshed immediately after one another
• Distributed refresh: Each row refreshed at a different time, at regular intervals
87
(88)
DRAM Refresh (II)
• Distributed refresh eliminates long pause times
88
(89)
• Downsides of refresh
-- Energy consumption: Each refresh consumes energy
-- Performance degradation: DRAM rank/bank unavailable while refreshed
-- QoS/predictability impact: (Long) pause times during refresh
Downsides of DRAM Refresh
89
(90)
Back to the paper…
CTA Data Layout (A Simple Example)
91
A(0,0) A(0,1) A(0,2) A(0,3)
:
:
DRAM Data Layout (Row Major)
Bank 1 Bank 2 Bank
3
Bank 4
A(1,0) A(1,1) A(1,2) A(1,3)
:
:
A(2,0) A(2,1) A(2,2) A(2,3)
:
:
A(3,0) A(3,1) A(3,2) A(3,3)
:
:
Data Matrix
A(0,0) A(0,1) A(0,2) A(0,3)
A(1,0) A(1,1) A(1,2) A(1,3)
A(2,0) A(2,1) A(2,2) A(2,3)
A(3,0) A(3,1) A(3,2) A(3,3)
mapped to Bank 1
CTA 1 CTA 2
CTA 3 CTA 4
mapped to Bank 2
mapped to Bank 3
mapped to Bank 4
Average percentage of consecutive CTAs (out of
total CTAs) accessing the same row = 64%
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
L2 Cache
Implications of high CTA-row sharing
92
W
CTA-1
W
CTA-3
W
CTA-2
W
CTA-4
SIMT Core-1 SIMT Core-2
Bank-1 Bank-2 Bank-3 Bank-4
Row-1 Row-2 Row-3 Row-4
Idle
Banks
W W W W
CTA Prioritization Order CTA Prioritization Order
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
High Row Locality
Low Bank Level Parallelism
Bank-1
Row-1
Bank-2
Row-2
Bank-1
Row-1
Bank-2
Row-2
Req
Req
Req
Req
Req
Req
Req
Req
Req
Req
Lower Row Locality
Higher Bank Level
Parallelism
Req
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
(94)
Some Additional Details
• Spread reference from multiple CTAs (on multiple SMs) across row
buffers in the distinct banks
• Do not use same CTA group prioritization across SMs
❖ Play the odds
• What happens with applications with unstructured, irregular memory
access patterns?
L2 Cache
Objective 2: Improving Bank Level Parallelism
W
CTA-1
W
CTA-3
W
CTA-2
W
CTA-4
SIMT Core-1 SIMT Core-2
Bank-1 Bank-2 Bank-3 Bank-4
Row-1 Row-2 Row-3 Row-4
W W W W
11% increase in bank-level parallelism
14% decrease in row buffer locality
CTA Prioritization OrderCTA Prioritization Order
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
L2 Cache
Objective 3: Recovering Row Locality
W
CTA-1
W
CTA-3
W
CTA-2
W
CTA-4
SIMT Core-2
Bank-1 Bank-2 Bank-3 Bank-4
Row-1 Row-2 Row-3 Row-4
W W W W
Memory Side
Prefetching
L2 Hits!
SIMT Core-1
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
Memory Side Prefetching◼ Prefetch the so-far-unfetched cache lines in an already open row into the L2 cache, just before it
is closed
◼ What to prefetch?
❑ Sequentially prefetches the cache lines that were not accessed by demand requests
❑ Sophisticated schemes are left as future work
◼ When to prefetch?
❑ Opportunistic in Nature
❑ Option 1: Prefetching stops as soon as demand request comes for another row. (Demands are
always critical)
❑ Option 2: Give more time for prefetching, make demands wait if there are not many. (Demands
are NOT always critical)
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
IPC results (Normalized to Round-Robin)
0.6
1.0
1.4
1.8
2.2
2.6
3.0
SA
D
PV
C
SS
C
BF
S
MU
M
CF
D
KM
N
SC
P
FW
T
IIX
SP
MV
JP
EG
BF
SR
SC
FF
T
SD
2
WP
PV
R
BP
AV
G -
T1
Norm
aliz
ed IP
C
Objective 1 Objective (1+2) Objective (1+2+3) Perfect-L2
25% 31% 33%
◼ 11% within Perfect L2
44%
Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013
(99)
Summary
• Coordinated scheduling across SMs, CTAs, and warps
• Consideration of effects deeper in the memory system
• Coordinating warp residence in the core with the presence of
corresponding lines in the cache
(100)
CAWA: Coordinated Warp Scheduling and Cache
Prioritization for Critical Warp Acceleration in GPGPU
Workloads S. –Y Lee, A. A. Kumar and C. J Wu
ISCA 2015
Goal• Reduce warp divergence and hence increase throughput
• The key is the identification of critical (lagging) warps
• Manage resources and scheduling decisions to speed up the execution of
critical warps thereby reducing divergence
(101)
Review: Resource Limits on Occupancy
SM Scheduler
Kernel Distributor
SM SM SM SM
DRAM
Limits the #threads
Limits the #thread blocks
Warp Schedulers
Warp ContextWarp ContextWarp ContextWarp Context
Register File
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
SP
TB 0
Thread Block Control
L1/Shared MemoryLimits the #thread
blocks
Limits the #threads
SM – Stream Multiprocessor
SP – Stream Processor
Locality effects
(102)
Evolution of Warps in TB
• Coupled lifetimes of warps in a TB
❖ Start at the same time
❖ Synchronization barriers
❖ Kernel exit (implicit synchronization barrier)
Completed warps
Figure from P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation
Region where latency hiding is less effective
(103)
Warp Criticality Problem
Host CPU
Inte
rcon
nec
tio
n B
us
GPU
SMX SMX SMX SMX
Kernel Distributor
SMX Scheduler Core Core Core Core
Registers
L1 Cache / Shard Memory
Warp Schedulers
Warp Context
Ker
ne
l Man
age
me
nt
Un
it
HW
Wor
k Q
ueu
esPe
nd
ing
Ker
nel
s
Memory Controller
PC Dim Param ExeBLKernel Distributor Entry
Control Registers
DRAML2 Cache
TB
Warp
Warp
Available registers: spatial underutilization
Registers allocated to completed warps in the
TB: temporal underutilization
The last warp
Warp Context
Completed (idle) warp contexts
Manage resources and schedules around Critical Warps
Temporal & spatial underutilization
(104)
The Warp Criticality Problem
• Significant warp execution disparity for warps in the same thread block
1.5
1.4
1.3
1.2
1.1
1.0
0.9
0.8
w0
w1
w2
w3
w4
w5
w6
w7
w8
w9
w1
0
w1
1
w1
2
w1
3
w1
4
w1
5
Exe
cu6
on
Tim
e (
Co
mp
are
dto
the
Fa
ste
st W
arp
)
Warps (Sorted by Cri6cality)
*Lee and Wu, “CAWS: Cri)cality-‐Aware Warp Scheduling for GPGPU Workloads,” PACT-‐2014
Breadth-‐first-‐search
0%
25%
50%
75%
bfs
b+
tree
he
artw
all
kmea
ns
nee
dle
srad
_1
strc
ltr_
smal
l
Avg
Exe
cu6
on
Tim
e D
isp
arit
y
Computation Idle Time
100%
40%
(105)
Research Questions
• What is the source of warp criticality?
• How can we effectively accelerate critical warp execution?
5/28
(106)
Source of Warp Criticality
• Workload Imbalance
• Diverging Branch Behavior
• Memory Contention and Memory Access Latency
• Execution Order of Warp Scheduling
7/28
(107)
Workload Imbalance & Diverging Branch
8/28
0.9
1.0
1.1
1.21.51.41.31.21.11.00.90.8
w0
w1
w2
w3
w4
w5
w6
w7
w8
w9
w1
0
w1
1
w1
2
w1
3
w1
4
w1
5
Inst
Co
un
t N
orm
aliz
edto
th
e F
aste
st W
arp
Exe
cu6
on
Tim
e N
orm
aliz
ed
to
th
e F
aste
st W
arp
Warps (Sorted by Cri6cality)
• Workload imbalance or diverging branch behavior makes warps have different number of dynamic instruction counts.
exe time inst count
(108)
Memory Contention
9/28
w0
w1
w2
w3
w4
w5
w6
w7
w8
w9
w1
0
w1
1
w1
2
w1
3
w1
4
w1
5Exe
cu6
on
Tim
e N
orm
aliz
ed
to
th
e F
aste
st W
arp
Warps (Sorted by Cri6cality)
• While warps experience different latency to access memory, memory contention can induce warp criticality.
Total Memory System Latency Other Latency2.5
2
1.5
1
0.5
0
(109)
Warp Scheduling Order
10/28
w0
w1
w2
w3
w4
w5
w6
w7
w8
w9
w1
0
w1
1
w1
2
w1
3
w1
4
w1
5Exe
cu6
on
Tim
e N
orm
aliz
ed
to
th
e F
aste
st W
arp
Warps (Sorted by Cri6cality)
• The warp scheduler may introduce additional stall cycles for a ready warp, resulting in warp criticality
Scheduler Latency Other Latency2.5
2
1.5
1
0.5
0
(110)
CAWA: Criticality-Aware Warp Acceleration
Coordinated warp scheduling and cache prioritization design
• Criticality Prediction Logic (CPL)– Predicting and identifying the critical warp at runtime
• greedy Criticality Aware Warp Scheduler (gCAWS)
– Prioritizing and accelerating the critical warp execution
• Criticality-‐Aware Cache Prioritization (CACP)– Prioritizing and allocating cache lines for critical warp reuse
(111)
CAWA: Criticality-Aware Warp Acceleration
• Criticality Prediction Logic (CPL)– Predic6ng and iden6fying the cri6cal warp at run6me
• greedy Criticality Aware Warp Scheduler (gCAWS)
– Prioritizing and accelerating the critical warp execution
• Criticality-‐Aware Cache Prioritization (CACP)
– Prioritizing and allocating cache lines for critical warp reuse
(112)
CAWACPL : Criticality Prediction Logic
14/28
• Evaluating number of additional cycles a warp may experience
• nInst is decremented whenever an instruction is executed
Criticality = nInst * w.CPIavg + nStall
instruction count disparity diverging branch
memory latency scheduling latency
(113)
• Criticality Prediction Logic (CPL)– Predicting and identifying the critical warp at runtime
• greedy Criticality Aware Warp Scheduler (gCAWS)– Prioritizing and accelerating the critical warp execu6on
• Criticality-‐Aware Cache Prioritization (CACP)
– Prioritizing and allocating cache lines for critical warp reuse
CAWA: Criticality-Aware Warp Acceleration
Reduce delay from the scheduler
(114)
• • Prioritizing warps based on their criticality given by CPL
• • Executing warps in a greedy* manner
• – Select the most critical ready-‐warp
• – Keep on executing the select warp until it stalls
• Warp Scheduler Selec6on Sequence
• Traditional Approach (e.g. RR, 2L, GTO):
• W0→W1→W2→W3
• gCAWS:
• W1→W3→W0→W2
16/28
*Rogers et al., “Cache-‐Conscious Wavefront Scheduler,” MICRO’12
Warp Pool Criticality
Warp 0 5
Warp 1 10
Warp 2 3
Warp 3 7
CAWAgCAWS : greedy Criticality-Aware Warp
Scheduler
(115)
CAWA: Criticality-Aware Warp Acceleration
• Criticality Prediction Logic (CPL)– Predicting and identifying the critical warp at runtime
• greedy Criticality Aware Warp Scheduler (gCAWS)
– Prioritizing and accelerating the critical warp execution
• Cri6cality-‐Aware Cache Priori6za6on (CACP)– Priori6zing and alloca6ng cache lines for cri6cal warp reuse
Reduce delay from the memory
(116)
CAWACACP : Criticality-Aware Cache Prioritization
*Wu et al., “SHiP: Signatured-‐based Hit Predictor for High Performance Caching,” MICRO’1119/28
(117)20/28
Cache Request
CAWACACP : Criticality-Aware Cache
Prioritization
(118)
No
rmal
ize
d M
PK
I1.2
1.1
1.0
0.91.21.11.00.90.80.7
No
rmal
ize
dIP
C
CAWA2L GTO3.13x
2.54x
Overall Performance Improvement
23/28
CAWA aims to retain data for critical warps
Improving GPU performance via large warps and two-‐level warp scheduling. Narasiman et al. MICRO `11. Cache conscious wavefront scheduling. Rogers, O’Connor, and Aamodt. MICRO `12.
(119)
1.301.251.201.151.101.051.000.950.90
IPC
No
rmal
ized
to
Bas
elin
e R
R
gCAWS CAWA
2.63x3.13x
gCAWS Performance Improvement
24/28
Large memory footprint
(120)
Performance Improvement with CAWACACP
25/28
1.00
0.75
0.50
1.25
1.50
MP
KI N
orm
aliz
ed
to B
aselin
e R
RRR+CACP 2L + CACP GTO+CACP CAWA
Schedulers need modificationfor robust and higher
performance improvement withCACP
CACP can effectively reduce
cache interference with any schedulers
(121)
Summary
• Warp divergence leads to some lagging warps → critical warps
• Expose the performance impact of critical warps → throughput reduction
• Coordinate scheduler and cache management to reduce warp divergence