CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread...

Post on 26-Feb-2021

2 views 0 download

Transcript of CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread...

CS 758: Advanced Topics in Computer Architecture

Lecture #8: GPU Warp Scheduling Research + DRAM Basics

Professor Matthew D. Sinclair

Some of these slides were developed by Tim Rogers at the Purdue University, Tor Aamodt at the University of British Columbia, Wen-mei Hwu & David Kirk at the University of Illinois at Urbana-Champaign, Sudhakar Yalamanchili Georgia Tech, and Prof. Onur Mutlu at Carnegie Mellon University.

Slides enhanced by Matt Sinclair

Studying Warp Scheduling on GPUs

• Numerous works on manipulating different schedulers

• Most looking at the SM-side issue-level warp scheduler

• Some look at TB scheduling at the TB-core level

• Fetch scheduler and operand collector schedule less studied• Fetch largely follows issue.

• Not clear what the opportunity in the operand collector is.• Even an opportunity study here would be helpful.

Use Memory System Feedback[MICRO 2012]

3

0

0.5

1

1.5

2

0

10

20

30

40

Threads Actively Scheduled

Performance

Cache Misses

GPU Core

Processor CacheThread

Scheduler

Feedback

Programmability case study [MICRO 2013]Sparse Vector-Matrix Multiply

Simple Version

GPU-Optimized VersionSHOC Benchmark Suite

(Oakridge National Labs)

4

Added Complication

Dependent on Warp Size

Parallel Reduction

Explicit Scratchpad UseDivergence

Each thread has locality

Using DAWS scheduling

within 4% of optimized with no programmer input

Data CacheData Cache

Sources of Locality

Intra-wavefront locality Inter-wavefront locality

LD $line (X)

LD $line (X)

LD $line (X)

LD $line (X)

Wave0

Hit

Wave0 Wave1

Hit

5

0

20

40

60

80

100

120

AVG-Highly Cache Sensitive

(Hit

s/M

iss)

PK

I

Misses PKI

Inter-Wavefront Hits PKI

Intra-Wavefront Hits PKI

6

Scheduler affects access pattern

Memory

System

Wavefront

Scheduler

Wavefront

Scheduler

Round Robin Scheduler

Memory

System

Greedy then Oldest Scheduler

ld A,B,C,D…

DCBA

ld Z,Y,X,Wld A,B,C,D

WXYZ

... ...

ld Z,Y,X,W

DCBA

DCBA

ld A,B,C,D…

...

Wave0 Wave1 Wave0 Wave1

ld A,B,C,D…

7

Use scheduler to shape access pattern

Memory

System

Wavefront

Scheduler

Wavefront

Scheduler

Greedy then Oldest Scheduler

Memory

System

Cache-Conscious Wavefront Scheduling[MICRO 2012 best paper runner up,

Top Picks 2013, CACM Research Highlight]

ld A,B,C,D

DCBA

ld A,B,C,D

WXYZ

...

ld Z,Y,X,W

DCBA

DCBA

ld A,B,C,D…

...

ld Z,Y,X,W…

Wave0 Wave1 Wave0 Wave1

ld A,B,C,D…

8

Memory Unit

Cache

Victim Tags

Locality Scoring

SystemWave

Scheduler

W0

W1

W2

Tag WID Data

Tag

Tag

Tag

Tag

Tag

Tag

W0

W1

W2

Time

Score

Tag WID Data

W0

W1

W2

No W2

loads

W0

W1

W2

…W0: ld X

X 0

W0,X X

W0

detected lost locality

W2: ld YW0: ld X

ProbeW0,X

Y 2

9

0

0.5

1

1.5

2

HMEAN-Highly Cache-Sensitive

Spe

ed

up

LRR GTO CCWS

10

Static Wavefront Limiting[Rogers et al., MICRO 2012]

• Profiling an application we can find an optimal number of wavefrontsto execute

• Does a little better than CCWS.

• Limitations: Requires profiling, input dependent, does not exploit phase behavior.

11

Improve upon CCWS?

• CCWS detects bad scheduling decisions and avoids them in future.

• Would be better if we could “think ahead” / “be proactive” instead of “being reactive”

12

(13)

Divergence Aware Warp Scheduling T. Rogers, M

O’Conner, and T. Aamodt MICRO 2013 Goal

• Design a scheduler to match #scheduled wavefronts with the L1 cache

size

❖ Working set of the wavefronts fits in the cache

❖ Emphasis on intra-wavefront locality

• Differs from CCWS in being proactive

❖ Deeper look at what happens inside loops

❖ Proactive

❖ Explicitly Handles Divergence (both memory and control flow)

(14)

Key Idea

• Manage the relationship between control divergence, memory divergence and scheduling

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(15)

Key Idea (2)

while (i <C[tid+1])

warp 0

warp 1

Fill the cache with 4 references –delay warp 1Divergent

branch

Intra-thread locality

Available room in the cache,

schedule warp 1

Use warp 0 behavior to predict interference due to warp 1

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(16)

Goal

Simpler portable version GPU-Optimized Version

Make the performance equivalent

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(17)

Observation

• Bulk of the accesses in a loop come from a few static

load instructions

• Bulk of the locality in (these) applications is intra-loop

Loops

(18)

Distribution of Locality

Bulk of the locality comes form a few static loads in loops

Hint: Can we keep data from last iteration?

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(19)

A Solution

• Prediction mechanisms for

locality across iterations of

a loop

• Schedule such that data

fetched in one iteration is

still present at next

iteration

• Combine with control flow

divergence (how much of

the footprint needs to be in

the cache?)

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(20)

Classification of Dynamic Loads

• Group static loads into equivalence classes →

reference the same cache line

• Identify these groups by repetition ID

• Prediction for each load by compiler or hardware

converged

Diverged

Somewhat diverged

Control flow divergence

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(21)

Predicting a Warp’s Cache Footprint• Entering loop body• Create footprint prediction

Warp

• Exit loop• Reinitialize prediction

• Some threads exit the loop• Predicted footprint drops

Warp

• Predict locality usage of static loads

❖ Not all loads increase the footprint

• Combine with control divergence to predict footprint

• Use footprint to throttle/not-throttle warp issue

Taken

Not Taken

(22)

Principles of Operation

• Prefix sum of each warp’s cache footprint used to

select warps that can be issued

EffCacheSize= kAssocFactor.TotalNumLines

• Scaling back from a fully associative cache• Empirically determined

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(23)

Principles of Operation (2)

• Profile static load

instructions

❖ Are they divergent?

❖ Loop repetition ID

o Assume all loads with

same base address and

offset within cache line

access are repeated each

iteration

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(24)

Prediction Mechanisms

• Profiled Divergence Aware Scheduling (DAWS)

❖ Used offline profile results to dynamically determine de-scheduling decisions

• Detected Divergence Aware Scheduling (DAWS)

❖ Behaviors derived at run-time to drive de-scheduling decisions

o Loops that exhibit intra-warp locality

o Static loads are characterized as divergent or convergent

(25)

Extensions for DAWS

I-Fetch

Decode

OC

PRF

D-Cache

DataAll Hit?

Writeback

scala

rPip

elin

e

scala

rpip

elin

e

scala

rpip

elin

e

Issue

I-Buffer

+

+

(26)

Operation: Tracking

Basis for throttling

Profile-based information

One entry per warp issue

slot

Created/removed at

loop begin/end

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

#Active Lanes

(27)

Operation: Prediction

Sum results from static loads in this loop

- Add #active-lanes of cache lines for divergent loads

- Add 2 for converged loads- Count loads in the same

equivalence class only once (unless divergent)

• Generally only considering de-scheduling warps in loops

❖ Since most of the activity is here

• Can be extended to non-loop regions by associating non-loop

code with next loop

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(28)

Operation: Nested Loops

for (i=0; i<limitx; i++)}{....

for (j=0: j<limity; j++){....}

..

..}

• On-entry update prediction to that of inner loop

• On re-entry predict based inner loop predictions

On-exit, do not clear prediction

De-scheduling of warps determined

by inner loop behaviors!

• Re-used predictions based on inner-most loops which

is where most of the date re-use is found

(29)

Detected DAWS: Prediction

Sampling warp for the loop (>2

active threads)

• Detect both memory divergence and intra-loop

repetition at run time

• Fill PCLoad entries based on run time information

if locality

enter load

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(30)

Detected DAWS: Classification

Increment or decrement the counter depending on #memory accesses for a load

Create equivalence classes of loads (checking the PCs)

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(31)

Performance

Little to no degradation

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(32)

Performance

Significant intra-warp locality

SPMV-scalar normalized to best

SPMV-vector

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

(33)

Summary

• If we can characterize warp level memory reference locality, we can use

this information to minimize interference in the cache through scheduling

constraints

• Proactive scheme outperforms reactive management

• Understand interactions between memory divergence and control

divergence

(34)

OWL: Cooperative Thread Array Aware Scheduling

Techniques for Improving GPGPU Performance

A. Jog et. al ASPLOS 2013 Goal

• Understand memory effects of scheduling from deeper within the memory

hierarchy

• Minimize idle cycles induced by stalling warps waiting on memory

references

Off-chip Bandwidth is Critical!

35

Percentage of total execution cycles wasted waiting for the

data to come back from DRAM

0%

20%

40%

60%

80%

100%

SA

D

PV

C

SS

C

BF

S

MU

M

CF

D

KM

N

SC

P

FW

T

IIX

SP

MV

JP

EG

BF

SR

SC

FF

T

SD

2

WP

PV

R

BP

CO

N

AE

S

SD

1

BL

K

HS

SL

A

DN

LP

S

NN

PF

N

LY

TE

LU

D

MM

ST

O

CP

NQ

U

CU

TP

HW

TP

AF

AV

G

AV

G-T

1

Type-1

Applications

55%AVG: 32%

Type-2

Applications

GPGPU Applications

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

(36)

Source of Idle Cycles

• Warps stalled on waiting for memory reference

❖ Cache miss

❖ Service at the memory controller

❖ Row buffer miss in DRAM

❖ Latency in the network (not addressed in this paper)

• The last warp effect

• The last CTA effect

• Lack of multiprogrammed execution

❖ One (small) kernel at a time

(37)

Impact of Idle Cycles

Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

High-Level View of a GPU

DRAM

SIMT Cores

Scheduler

ALUsL1 Caches

Threads

WW W W W W

Warps

L2 cache

Interconnect

CTA CTA CTA CTA

Cooperative

Thread

Arrays

(CTAs)

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Warp Scheduler

ALUsL1 Caches

CTA-Assignment Policy (Example)

39

Warp Scheduler

ALUsL1 Caches

Multi-threaded CUDA Kernel

SIMT Core-1 SIMT Core-2

CTA-1 CTA-2 CTA-3 CTA-4

CTA-2 CTA-4CTA-1 CTA-3

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

(40)

Organizing CTAs Into Groups

• Set minimum number of warps equal to #pipeline stages

❖ Same philosophy as the two-level warp scheduler

• Use same CTA grouping/numbering across SMs?

Warp Scheduler

ALUsL1 Caches

Warp Scheduler

ALUsL1 Caches

SIMT Core-1 SIMT Core-2

CTA-2 CTA-4CTA-1 CTA-3

Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Warp Scheduling Policy◼ All launched warps on a SIMT core have equal priority

❑ Round-Robin execution

◼ Problem: Many warps stall at long latency operations

roughly at the same time

41

WW

CTA

WW

CTA

WW

CTA

WW

CTA

All warps compute

All warps have equal priority

WW

CTA

WW

CTA

WW

CTA

WW

CTA

All warps compute

All warps have equal priority

Send Memory Requests

SIMT

Core

Stalls

Time

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Solution

42

WW

CTA

WW

CTA

WW

CTA

WW

CTA

WW

CTA

WW

CTA

WW

CTA

WW

CTA

Send Memory Requests

Saved Cycles

WW

CTA

WW

CTA

WW

CTA

WW

CTA

All warps compute

All warps have equal priority

WW

CTA

WW

CTA

WW

CTA

WW

CTA

All warps compute

All warps have equal priority

Send Memory Requests

SIMT Core

Stalls

Time

• Form Warp-Groups

(Narasiman MICRO’11)

• CTA-Aware grouping

• Group Switch is Round-Robin

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

(43)

Two Level Round Robin Scheduler

CTA0 CTA1

CTA3 CTA2

CTA4 CTA5

CTA7 CTA8

CTA12 CTA13

CTA15 CTA14

Group 0 Group 1

Group 3Group 2

RR

RR

Thread

Agnostic to whenpending misses

are satisfied

Objective 1: Improve Cache Hit Rates

44

CTA 1 CTA 3 CTA 5 CTA 7CTA 1 CTA 3 CTA 5 CTA 7

Data for CTA1 arrives.

No switching.

CTA 3 CTA 5 CTA 7CTA 1 CTA 3 C5 CTA 7CTA 1 C5

Data for CTA1 arrives.

TNo Switching: 4 CTAs in Time T

Switching: 3 CTAs in Time T

Fewer CTAs accessing the cache concurrently → Less cache contention

Time

Switch to CTA1.

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Reduction in L1 Miss Rates

45

◼ Limited benefits for cache insensitive applications

◼ What is happening deeper in the memory system?

0.00

0.20

0.40

0.60

0.80

1.00

1.20

SAD SSC BFS KMN IIX SPMV BFSR AVG.No

rmal

ized

L1

Mis

s R

ates

8%

18%

Round-Robin CTA-Grouping CTA-Grouping-Prioritization

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

(46)

The Off-Chip Memory Path

Off-chip GDDR5

Off-chip GDDR5

CU 0

To

CU 15

CU 16

To

CU 31

MC

MC

MC

MC

MC MC

Off-chip GDDR5

Off-chip GDDR5

Off-chip GDDR5

Off-chip GDDR5

Access patterns?

Ordering and buffering?

(47)

Inter-CTA Locality

Warp Scheduler

ALUsL1 Caches

Warp Scheduler

ALUsL1 Caches

CTA-2 CTA-4CTA-1 CTA-3

DRAM DRAM DRAM DRAM

How do CTAs Interact at the MC and in DRAM?

(48)

Impact of the Memory Controller

• Memory scheduling

policies

❖ Optimize BW vs. memory

latency

• Impact of row buffer

access locality

• Cache lines?

(49)

Row Buffer Locality

The DRAM Subsystem

(51)

DRAM Subsystem Organization

• Channel

• DIMM

• Rank

• Chip

• Bank

• Row/Column

51

(52)

Page Mode DRAM

• A DRAM bank is a 2D array of cells: rows x columns

• A “DRAM row” is also called a “DRAM page”

• “Sense amplifiers” also called “row buffer”

• Each address is a <row,column> pair

• Access to a “closed row”❖ Activate command opens row (placed into row buffer)

❖ Read/write command reads/writes column in the row buffer

❖ Precharge command closes the row and prepares the bank for next access

• Access to an “open row”❖ No need for activate command

52

(53)

DRAM Bank Operation

53

Row Buffer

(Row 0, Column 0)

Row

decoder

Column mux

Row address 0

Column address 0

Data

Row 0Empty

(Row 0, Column 1)

Column address 1

(Row 0, Column 85)

Column address 85

(Row 1, Column 0)

HITHIT

Row address 1

Row 1

Column address 0

CONFLICT !

Columns

Row

s

Access Address:

(54)

The DRAM Chip

• Consists of multiple banks (2-16 in Synchronous DRAM)

• Banks share command/address/data buses

• The chip itself has a narrow interface (4-16 bits per read)

54

(55)

128M x 8-bit DRAM Chip

55

(56)

DRAM Rank and Module

• Rank: Multiple chips operated together to form a wide interface

• All chips comprising a rank are controlled at the same time

❖ Respond to a single command

❖ Share address and command buses, but provide different data

❖ Like DRAM “SIMD”

• A DRAM module consists of one or more ranks

❖ E.g., DIMM (dual inline memory module)

❖ This is what you plug into your motherboard

• If we have chips with 8-bit interface, to read 8 bytes in a single access, use 8 chips in a DIMM

56

(57)

A 64-bit Wide DIMM (One Rank)

57

DRAM

Chip

DRAM

Chip

DRAM

Chip

DRAM

Chip

DRAM

Chip

DRAM

Chip

DRAM

Chip

DRAM

Chip

Command Data

(58)

Multiple DIMMs

58

• Advantages:

❖ Enables even higher capacity

• Disadvantages:

❖ Interconnect complexity and energy consumption can be high

(59)

DRAM Channels

• 2 Independent Channels: 2 Memory Controllers (Above)

• 2 Dependent/Lockstep Channels: 1 Memory Controller with wide interface (Not Shown above)59

(60)

Generalized Memory Structure

60

(61)

Generalized Memory Structure

61

The DRAM Subsystem

The Top Down View

(63)

DRAM Subsystem Organization

• Channel

• DIMM

• Rank

• Chip

• Bank

• Row/Column

63

(64)

The DRAM subsystem

Memory channel Memory channel

DIMM (Dual in-line memory module)

Processor

“Channel”

(65)

Breaking down a DIMM

DIMM (Dual in-line memory module)

Side view

Front of DIMM Back of DIMM

(66)

Breaking down a DIMM

DIMM (Dual in-line memory module)

Side view

Front of DIMM Back of DIMM

Rank 0: collection of 8 chips Rank 1

(67)

Rank

Rank 0 (Front) Rank 1 (Back)

Data <0:63>CS <0:1>Addr/Cmd

<0:63><0:63>

Memory channel

(68)

Breaking down a Rank

Rank 0

<0:63>

Ch

ip 0

Ch

ip 1

Ch

ip 7. . .

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

(69)

Breaking down a Chip

Ch

ip 0

<0:7

>

Bank 0

<0:7>

<0:7>

<0:7>

...

<0:7

>

(70)

Breaking down a Bank

Bank 0

<0:7

>

row 0

row 16k-1

...

2kB

1B

1B (column)

1B

Row-buffer

1B

...

<0:7

>

(71)

DRAM Subsystem Organization

• Channel

• DIMM

• Rank

• Chip

• Bank

• Row/Column

71

(72)

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Channel 0

DIMM 0

Rank 0

(73)

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

. . .

(74)

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

Row 0Col 0

. . .

(75)

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

8B

Row 0Col 0

. . .

8B

(76)

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

8B

Row 0Col 1

. . .

(77)

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

8B

8B

Row 0Col 1

. . .

8B

(78)

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

8B

8B

Row 0Col 1

A 64B cache block takes 8 I/O cycles to transfer.

During the process, 8 columns are read sequentially.

. . .

(79)

Latency Components: Basic DRAM Operation

• CPU → controller transfer time

• Controller latency

❖ Queuing & scheduling delay at the controller

❖ Access converted to basic commands

• Controller → DRAM transfer time

• DRAM bank latency

❖ Simple CAS if row is “open” OR

❖ RAS + CAS if array precharged OR

❖ PRE + RAS + CAS (worst case)

• DRAM → CPU transfer time (through controller)

79

(80)

Multiple Banks (Interleaving) and Channels

• Multiple banks

❖ Enable concurrent DRAM accesses

❖ Bits in address determine which bank an address resides in

• Multiple independent channels serve the same purpose

❖ But they are even better because they have separate data buses

❖ Increased bus bandwidth

• Enabling more concurrency requires reducing

❖ Bank conflicts

❖ Channel conflicts

• How to select/randomize bank/channel indices in address?

❖ Lower order bits have more entropy

❖ Randomizing hash functions (XOR of different address bits)

80

(81)

How Multiple Banks/Channels Help

81

(82)

Multiple Channels

• Advantages

❖ Increased bandwidth

❖ Multiple concurrent accesses (if independent channels)

• Disadvantages

❖ Higher cost than a single channel

o More board wires

o More pins (if on-chip memory controller)

82

(83)

Address Mapping (Single Channel)

• Single-channel system with 8-byte memory bus

❖ 2GB memory, 8 banks, 16K rows & 2K columns per bank

• Row interleaving

❖ Consecutive rows of memory in consecutive banks

• Cache block interleavingo Consecutive cache block addresses in consecutive banks

o 64 byte cache blocks

o Accesses to consecutive cache blocks can be serviced in parallel

o How about random accesses? Strided accesses?83

Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)

Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)

3 bits8 bits

(84)

Bank Mapping Randomization

• DRAM controller can randomize the address mapping to banks so that bank conflicts are less likely

84

Column (11 bits)3 bits Byte in bus (3 bits)

XOR

Bank index

(3 bits)

(85)

Address Mapping (Multiple Channels)

• Where are consecutive cache blocks?

85

Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C

Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C

Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C

Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C

Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)

3 bits8 bits

C

Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)

3 bits8 bits

C

Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)

3 bits8 bits

C

Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)

3 bits8 bits

C

Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)

3 bits8 bits

C

(86)

Interaction with Virtual→Physical Mapping

• Operating System influences where an address maps to in DRAM

• Operating system can control which bank/channel/rank a virtual page is mapped to.

• It can perform page coloring to minimize bank conflicts

• Or to minimize inter-application interference

86

Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)

Page offset (12 bits)Physical Frame number (19 bits)

Page offset (12 bits)Virtual Page number (52 bits) VA

PA

PA

(87)

DRAM Refresh (I)

• DRAM capacitor charge leaks over time

• The memory controller needs to read each row periodically to restore the charge

❖ Activate + precharge each row every N ms

❖ Typical N = 64 ms

• Implications on performance?

-- DRAM bank unavailable while refreshed

-- Long pause times: If we refresh all rows in burst, every 64ms the DRAM will be unavailable until refresh ends

• Burst refresh: All rows refreshed immediately after one another

• Distributed refresh: Each row refreshed at a different time, at regular intervals

87

(88)

DRAM Refresh (II)

• Distributed refresh eliminates long pause times

88

(89)

• Downsides of refresh

-- Energy consumption: Each refresh consumes energy

-- Performance degradation: DRAM rank/bank unavailable while refreshed

-- QoS/predictability impact: (Long) pause times during refresh

Downsides of DRAM Refresh

89

(90)

Back to the paper…

CTA Data Layout (A Simple Example)

91

A(0,0) A(0,1) A(0,2) A(0,3)

:

:

DRAM Data Layout (Row Major)

Bank 1 Bank 2 Bank

3

Bank 4

A(1,0) A(1,1) A(1,2) A(1,3)

:

:

A(2,0) A(2,1) A(2,2) A(2,3)

:

:

A(3,0) A(3,1) A(3,2) A(3,3)

:

:

Data Matrix

A(0,0) A(0,1) A(0,2) A(0,3)

A(1,0) A(1,1) A(1,2) A(1,3)

A(2,0) A(2,1) A(2,2) A(2,3)

A(3,0) A(3,1) A(3,2) A(3,3)

mapped to Bank 1

CTA 1 CTA 2

CTA 3 CTA 4

mapped to Bank 2

mapped to Bank 3

mapped to Bank 4

Average percentage of consecutive CTAs (out of

total CTAs) accessing the same row = 64%

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

L2 Cache

Implications of high CTA-row sharing

92

W

CTA-1

W

CTA-3

W

CTA-2

W

CTA-4

SIMT Core-1 SIMT Core-2

Bank-1 Bank-2 Bank-3 Bank-4

Row-1 Row-2 Row-3 Row-4

Idle

Banks

W W W W

CTA Prioritization Order CTA Prioritization Order

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

High Row Locality

Low Bank Level Parallelism

Bank-1

Row-1

Bank-2

Row-2

Bank-1

Row-1

Bank-2

Row-2

Req

Req

Req

Req

Req

Req

Req

Req

Req

Req

Lower Row Locality

Higher Bank Level

Parallelism

Req

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

(94)

Some Additional Details

• Spread reference from multiple CTAs (on multiple SMs) across row

buffers in the distinct banks

• Do not use same CTA group prioritization across SMs

❖ Play the odds

• What happens with applications with unstructured, irregular memory

access patterns?

L2 Cache

Objective 2: Improving Bank Level Parallelism

W

CTA-1

W

CTA-3

W

CTA-2

W

CTA-4

SIMT Core-1 SIMT Core-2

Bank-1 Bank-2 Bank-3 Bank-4

Row-1 Row-2 Row-3 Row-4

W W W W

11% increase in bank-level parallelism

14% decrease in row buffer locality

CTA Prioritization OrderCTA Prioritization Order

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

L2 Cache

Objective 3: Recovering Row Locality

W

CTA-1

W

CTA-3

W

CTA-2

W

CTA-4

SIMT Core-2

Bank-1 Bank-2 Bank-3 Bank-4

Row-1 Row-2 Row-3 Row-4

W W W W

Memory Side

Prefetching

L2 Hits!

SIMT Core-1

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Memory Side Prefetching◼ Prefetch the so-far-unfetched cache lines in an already open row into the L2 cache, just before it

is closed

◼ What to prefetch?

❑ Sequentially prefetches the cache lines that were not accessed by demand requests

❑ Sophisticated schemes are left as future work

◼ When to prefetch?

❑ Opportunistic in Nature

❑ Option 1: Prefetching stops as soon as demand request comes for another row. (Demands are

always critical)

❑ Option 2: Give more time for prefetching, make demands wait if there are not many. (Demands

are NOT always critical)

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

IPC results (Normalized to Round-Robin)

0.6

1.0

1.4

1.8

2.2

2.6

3.0

SA

D

PV

C

SS

C

BF

S

MU

M

CF

D

KM

N

SC

P

FW

T

IIX

SP

MV

JP

EG

BF

SR

SC

FF

T

SD

2

WP

PV

R

BP

AV

G -

T1

Norm

aliz

ed IP

C

Objective 1 Objective (1+2) Objective (1+2+3) Perfect-L2

25% 31% 33%

◼ 11% within Perfect L2

44%

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

(99)

Summary

• Coordinated scheduling across SMs, CTAs, and warps

• Consideration of effects deeper in the memory system

• Coordinating warp residence in the core with the presence of

corresponding lines in the cache

(100)

CAWA: Coordinated Warp Scheduling and Cache

Prioritization for Critical Warp Acceleration in GPGPU

Workloads S. –Y Lee, A. A. Kumar and C. J Wu

ISCA 2015

Goal• Reduce warp divergence and hence increase throughput

• The key is the identification of critical (lagging) warps

• Manage resources and scheduling decisions to speed up the execution of

critical warps thereby reducing divergence

(101)

Review: Resource Limits on Occupancy

SM Scheduler

Kernel Distributor

SM SM SM SM

DRAM

Limits the #threads

Limits the #thread blocks

Warp Schedulers

Warp ContextWarp ContextWarp ContextWarp Context

Register File

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

TB 0

Thread Block Control

L1/Shared MemoryLimits the #thread

blocks

Limits the #threads

SM – Stream Multiprocessor

SP – Stream Processor

Locality effects

(102)

Evolution of Warps in TB

• Coupled lifetimes of warps in a TB

❖ Start at the same time

❖ Synchronization barriers

❖ Kernel exit (implicit synchronization barrier)

Completed warps

Figure from P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation

Region where latency hiding is less effective

(103)

Warp Criticality Problem

Host CPU

Inte

rcon

nec

tio

n B

us

GPU

SMX SMX SMX SMX

Kernel Distributor

SMX Scheduler Core Core Core Core

Registers

L1 Cache / Shard Memory

Warp Schedulers

Warp Context

Ker

ne

l Man

age

me

nt

Un

it

HW

Wor

k Q

ueu

esPe

nd

ing

Ker

nel

s

Memory Controller

PC Dim Param ExeBLKernel Distributor Entry

Control Registers

DRAML2 Cache

TB

Warp

Warp

Available registers: spatial underutilization

Registers allocated to completed warps in the

TB: temporal underutilization

The last warp

Warp Context

Completed (idle) warp contexts

Manage resources and schedules around Critical Warps

Temporal & spatial underutilization

(104)

The Warp Criticality Problem

• Significant warp execution disparity for warps in the same thread block

1.5

1.4

1.3

1.2

1.1

1.0

0.9

0.8

w0

w1

w2

w3

w4

w5

w6

w7

w8

w9

w1

0

w1

1

w1

2

w1

3

w1

4

w1

5

Exe

cu6

on

Tim

e (

Co

mp

are

dto

the

Fa

ste

st W

arp

)

Warps (Sorted by Cri6cality)

*Lee and Wu, “CAWS: Cri)cality-‐Aware Warp Scheduling for GPGPU Workloads,” PACT-‐2014

Breadth-‐first-‐search

0%

25%

50%

75%

bfs

b+

tree

he

artw

all

kmea

ns

nee

dle

srad

_1

strc

ltr_

smal

l

Avg

Exe

cu6

on

Tim

e D

isp

arit

y

Computation Idle Time

100%

40%

(105)

Research Questions

• What is the source of warp criticality?

• How can we effectively accelerate critical warp execution?

5/28

(106)

Source of Warp Criticality

• Workload Imbalance

• Diverging Branch Behavior

• Memory Contention and Memory Access Latency

• Execution Order of Warp Scheduling

7/28

(107)

Workload Imbalance & Diverging Branch

8/28

0.9

1.0

1.1

1.21.51.41.31.21.11.00.90.8

w0

w1

w2

w3

w4

w5

w6

w7

w8

w9

w1

0

w1

1

w1

2

w1

3

w1

4

w1

5

Inst

Co

un

t N

orm

aliz

edto

th

e F

aste

st W

arp

Exe

cu6

on

Tim

e N

orm

aliz

ed

to

th

e F

aste

st W

arp

Warps (Sorted by Cri6cality)

• Workload imbalance or diverging branch behavior makes warps have different number of dynamic instruction counts.

exe time inst count

(108)

Memory Contention

9/28

w0

w1

w2

w3

w4

w5

w6

w7

w8

w9

w1

0

w1

1

w1

2

w1

3

w1

4

w1

5Exe

cu6

on

Tim

e N

orm

aliz

ed

to

th

e F

aste

st W

arp

Warps (Sorted by Cri6cality)

• While warps experience different latency to access memory, memory contention can induce warp criticality.

Total Memory System Latency Other Latency2.5

2

1.5

1

0.5

0

(109)

Warp Scheduling Order

10/28

w0

w1

w2

w3

w4

w5

w6

w7

w8

w9

w1

0

w1

1

w1

2

w1

3

w1

4

w1

5Exe

cu6

on

Tim

e N

orm

aliz

ed

to

th

e F

aste

st W

arp

Warps (Sorted by Cri6cality)

• The warp scheduler may introduce additional stall cycles for a ready warp, resulting in warp criticality

Scheduler Latency Other Latency2.5

2

1.5

1

0.5

0

(110)

CAWA: Criticality-Aware Warp Acceleration

Coordinated warp scheduling and cache prioritization design

• Criticality Prediction Logic (CPL)– Predicting and identifying the critical warp at runtime

• greedy Criticality Aware Warp Scheduler (gCAWS)

– Prioritizing and accelerating the critical warp execution

• Criticality-‐Aware Cache Prioritization (CACP)– Prioritizing and allocating cache lines for critical warp reuse

(111)

CAWA: Criticality-Aware Warp Acceleration

• Criticality Prediction Logic (CPL)– Predic6ng and iden6fying the cri6cal warp at run6me

• greedy Criticality Aware Warp Scheduler (gCAWS)

– Prioritizing and accelerating the critical warp execution

• Criticality-‐Aware Cache Prioritization (CACP)

– Prioritizing and allocating cache lines for critical warp reuse

(112)

CAWACPL : Criticality Prediction Logic

14/28

• Evaluating number of additional cycles a warp may experience

• nInst is decremented whenever an instruction is executed

Criticality = nInst * w.CPIavg + nStall

instruction count disparity diverging branch

memory latency scheduling latency

(113)

• Criticality Prediction Logic (CPL)– Predicting and identifying the critical warp at runtime

• greedy Criticality Aware Warp Scheduler (gCAWS)– Prioritizing and accelerating the critical warp execu6on

• Criticality-‐Aware Cache Prioritization (CACP)

– Prioritizing and allocating cache lines for critical warp reuse

CAWA: Criticality-Aware Warp Acceleration

Reduce delay from the scheduler

(114)

• • Prioritizing warps based on their criticality given by CPL

• • Executing warps in a greedy* manner

• – Select the most critical ready-‐warp

• – Keep on executing the select warp until it stalls

• Warp Scheduler Selec6on Sequence

• Traditional Approach (e.g. RR, 2L, GTO):

• W0→W1→W2→W3

• gCAWS:

• W1→W3→W0→W2

16/28

*Rogers et al., “Cache-‐Conscious Wavefront Scheduler,” MICRO’12

Warp Pool Criticality

Warp 0 5

Warp 1 10

Warp 2 3

Warp 3 7

CAWAgCAWS : greedy Criticality-Aware Warp

Scheduler

(115)

CAWA: Criticality-Aware Warp Acceleration

• Criticality Prediction Logic (CPL)– Predicting and identifying the critical warp at runtime

• greedy Criticality Aware Warp Scheduler (gCAWS)

– Prioritizing and accelerating the critical warp execution

• Cri6cality-‐Aware Cache Priori6za6on (CACP)– Priori6zing and alloca6ng cache lines for cri6cal warp reuse

Reduce delay from the memory

(116)

CAWACACP : Criticality-Aware Cache Prioritization

*Wu et al., “SHiP: Signatured-‐based Hit Predictor for High Performance Caching,” MICRO’1119/28

(117)20/28

Cache Request

CAWACACP : Criticality-Aware Cache

Prioritization

(118)

No

rmal

ize

d M

PK

I1.2

1.1

1.0

0.91.21.11.00.90.80.7

No

rmal

ize

dIP

C

CAWA2L GTO3.13x

2.54x

Overall Performance Improvement

23/28

CAWA aims to retain data for critical warps

Improving GPU performance via large warps and two-‐level warp scheduling. Narasiman et al. MICRO `11. Cache conscious wavefront scheduling. Rogers, O’Connor, and Aamodt. MICRO `12.

(119)

1.301.251.201.151.101.051.000.950.90

IPC

No

rmal

ized

to

Bas

elin

e R

R

gCAWS CAWA

2.63x3.13x

gCAWS Performance Improvement

24/28

Large memory footprint

(120)

Performance Improvement with CAWACACP

25/28

1.00

0.75

0.50

1.25

1.50

MP

KI N

orm

aliz

ed

to B

aselin

e R

RRR+CACP 2L + CACP GTO+CACP CAWA

Schedulers need modificationfor robust and higher

performance improvement withCACP

CACP can effectively reduce

cache interference with any schedulers

(121)

Summary

• Warp divergence leads to some lagging warps → critical warps

• Expose the performance impact of critical warps → throughput reduction

• Coordinate scheduler and cache management to reduce warp divergence