CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread...

121
CS 758: Advanced Topics in Computer Architecture Lecture #8: GPU Warp Scheduling Research + DRAM Basics Professor Matthew D. Sinclair Some of these slides were developed by Tim Rogers at the Purdue University, Tor Aamodt at the University of British Columbia, Wen-mei Hwu & David Kirk at the University of Illinois at Urbana-Champaign, Sudhakar Yalamanchili Georgia Tech, and Prof. Onur Mutlu at Carnegie Mellon University . Slides enhanced by Matt Sinclair

Transcript of CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread...

Page 1: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

CS 758: Advanced Topics in Computer Architecture

Lecture #8: GPU Warp Scheduling Research + DRAM Basics

Professor Matthew D. Sinclair

Some of these slides were developed by Tim Rogers at the Purdue University, Tor Aamodt at the University of British Columbia, Wen-mei Hwu & David Kirk at the University of Illinois at Urbana-Champaign, Sudhakar Yalamanchili Georgia Tech, and Prof. Onur Mutlu at Carnegie Mellon University.

Slides enhanced by Matt Sinclair

Page 2: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Studying Warp Scheduling on GPUs

• Numerous works on manipulating different schedulers

• Most looking at the SM-side issue-level warp scheduler

• Some look at TB scheduling at the TB-core level

• Fetch scheduler and operand collector schedule less studied• Fetch largely follows issue.

• Not clear what the opportunity in the operand collector is.• Even an opportunity study here would be helpful.

Page 3: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Use Memory System Feedback[MICRO 2012]

3

0

0.5

1

1.5

2

0

10

20

30

40

Threads Actively Scheduled

Performance

Cache Misses

GPU Core

Processor CacheThread

Scheduler

Feedback

Page 4: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Programmability case study [MICRO 2013]Sparse Vector-Matrix Multiply

Simple Version

GPU-Optimized VersionSHOC Benchmark Suite

(Oakridge National Labs)

4

Added Complication

Dependent on Warp Size

Parallel Reduction

Explicit Scratchpad UseDivergence

Each thread has locality

Using DAWS scheduling

within 4% of optimized with no programmer input

Page 5: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Data CacheData Cache

Sources of Locality

Intra-wavefront locality Inter-wavefront locality

LD $line (X)

LD $line (X)

LD $line (X)

LD $line (X)

Wave0

Hit

Wave0 Wave1

Hit

5

Page 6: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

0

20

40

60

80

100

120

AVG-Highly Cache Sensitive

(Hit

s/M

iss)

PK

I

Misses PKI

Inter-Wavefront Hits PKI

Intra-Wavefront Hits PKI

6

Page 7: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Scheduler affects access pattern

Memory

System

Wavefront

Scheduler

Wavefront

Scheduler

Round Robin Scheduler

Memory

System

Greedy then Oldest Scheduler

ld A,B,C,D…

DCBA

ld Z,Y,X,Wld A,B,C,D

WXYZ

... ...

ld Z,Y,X,W

DCBA

DCBA

ld A,B,C,D…

...

Wave0 Wave1 Wave0 Wave1

ld A,B,C,D…

7

Page 8: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Use scheduler to shape access pattern

Memory

System

Wavefront

Scheduler

Wavefront

Scheduler

Greedy then Oldest Scheduler

Memory

System

Cache-Conscious Wavefront Scheduling[MICRO 2012 best paper runner up,

Top Picks 2013, CACM Research Highlight]

ld A,B,C,D

DCBA

ld A,B,C,D

WXYZ

...

ld Z,Y,X,W

DCBA

DCBA

ld A,B,C,D…

...

ld Z,Y,X,W…

Wave0 Wave1 Wave0 Wave1

ld A,B,C,D…

8

Page 9: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Memory Unit

Cache

Victim Tags

Locality Scoring

SystemWave

Scheduler

W0

W1

W2

Tag WID Data

Tag

Tag

Tag

Tag

Tag

Tag

W0

W1

W2

Time

Score

Tag WID Data

W0

W1

W2

No W2

loads

W0

W1

W2

…W0: ld X

X 0

W0,X X

W0

detected lost locality

W2: ld YW0: ld X

ProbeW0,X

Y 2

9

Page 10: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

0

0.5

1

1.5

2

HMEAN-Highly Cache-Sensitive

Spe

ed

up

LRR GTO CCWS

10

Page 11: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Static Wavefront Limiting[Rogers et al., MICRO 2012]

• Profiling an application we can find an optimal number of wavefrontsto execute

• Does a little better than CCWS.

• Limitations: Requires profiling, input dependent, does not exploit phase behavior.

11

Page 12: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Improve upon CCWS?

• CCWS detects bad scheduling decisions and avoids them in future.

• Would be better if we could “think ahead” / “be proactive” instead of “being reactive”

12

Page 13: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(13)

Divergence Aware Warp Scheduling T. Rogers, M

O’Conner, and T. Aamodt MICRO 2013 Goal

• Design a scheduler to match #scheduled wavefronts with the L1 cache

size

❖ Working set of the wavefronts fits in the cache

❖ Emphasis on intra-wavefront locality

• Differs from CCWS in being proactive

❖ Deeper look at what happens inside loops

❖ Proactive

❖ Explicitly Handles Divergence (both memory and control flow)

Page 14: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(14)

Key Idea

• Manage the relationship between control divergence, memory divergence and scheduling

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 15: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(15)

Key Idea (2)

while (i <C[tid+1])

warp 0

warp 1

Fill the cache with 4 references –delay warp 1Divergent

branch

Intra-thread locality

Available room in the cache,

schedule warp 1

Use warp 0 behavior to predict interference due to warp 1

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 16: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(16)

Goal

Simpler portable version GPU-Optimized Version

Make the performance equivalent

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 17: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(17)

Observation

• Bulk of the accesses in a loop come from a few static

load instructions

• Bulk of the locality in (these) applications is intra-loop

Loops

Page 18: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(18)

Distribution of Locality

Bulk of the locality comes form a few static loads in loops

Hint: Can we keep data from last iteration?

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 19: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(19)

A Solution

• Prediction mechanisms for

locality across iterations of

a loop

• Schedule such that data

fetched in one iteration is

still present at next

iteration

• Combine with control flow

divergence (how much of

the footprint needs to be in

the cache?)

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 20: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(20)

Classification of Dynamic Loads

• Group static loads into equivalence classes →

reference the same cache line

• Identify these groups by repetition ID

• Prediction for each load by compiler or hardware

converged

Diverged

Somewhat diverged

Control flow divergence

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 21: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(21)

Predicting a Warp’s Cache Footprint• Entering loop body• Create footprint prediction

Warp

• Exit loop• Reinitialize prediction

• Some threads exit the loop• Predicted footprint drops

Warp

• Predict locality usage of static loads

❖ Not all loads increase the footprint

• Combine with control divergence to predict footprint

• Use footprint to throttle/not-throttle warp issue

Taken

Not Taken

Page 22: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(22)

Principles of Operation

• Prefix sum of each warp’s cache footprint used to

select warps that can be issued

EffCacheSize= kAssocFactor.TotalNumLines

• Scaling back from a fully associative cache• Empirically determined

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 23: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(23)

Principles of Operation (2)

• Profile static load

instructions

❖ Are they divergent?

❖ Loop repetition ID

o Assume all loads with

same base address and

offset within cache line

access are repeated each

iteration

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 24: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(24)

Prediction Mechanisms

• Profiled Divergence Aware Scheduling (DAWS)

❖ Used offline profile results to dynamically determine de-scheduling decisions

• Detected Divergence Aware Scheduling (DAWS)

❖ Behaviors derived at run-time to drive de-scheduling decisions

o Loops that exhibit intra-warp locality

o Static loads are characterized as divergent or convergent

Page 25: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(25)

Extensions for DAWS

I-Fetch

Decode

OC

PRF

D-Cache

DataAll Hit?

Writeback

scala

rPip

elin

e

scala

rpip

elin

e

scala

rpip

elin

e

Issue

I-Buffer

+

+

Page 26: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(26)

Operation: Tracking

Basis for throttling

Profile-based information

One entry per warp issue

slot

Created/removed at

loop begin/end

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

#Active Lanes

Page 27: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(27)

Operation: Prediction

Sum results from static loads in this loop

- Add #active-lanes of cache lines for divergent loads

- Add 2 for converged loads- Count loads in the same

equivalence class only once (unless divergent)

• Generally only considering de-scheduling warps in loops

❖ Since most of the activity is here

• Can be extended to non-loop regions by associating non-loop

code with next loop

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 28: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(28)

Operation: Nested Loops

for (i=0; i<limitx; i++)}{....

for (j=0: j<limity; j++){....}

..

..}

• On-entry update prediction to that of inner loop

• On re-entry predict based inner loop predictions

On-exit, do not clear prediction

De-scheduling of warps determined

by inner loop behaviors!

• Re-used predictions based on inner-most loops which

is where most of the date re-use is found

Page 29: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(29)

Detected DAWS: Prediction

Sampling warp for the loop (>2

active threads)

• Detect both memory divergence and intra-loop

repetition at run time

• Fill PCLoad entries based on run time information

if locality

enter load

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 30: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(30)

Detected DAWS: Classification

Increment or decrement the counter depending on #memory accesses for a load

Create equivalence classes of loads (checking the PCs)

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 31: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(31)

Performance

Little to no degradation

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 32: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(32)

Performance

Significant intra-warp locality

SPMV-scalar normalized to best

SPMV-vector

Figure from T. Rogers, M. O/Connor, T. Aamodt, “Divergence-Aware Warp Scheduling,” MICRO 2013

Page 33: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(33)

Summary

• If we can characterize warp level memory reference locality, we can use

this information to minimize interference in the cache through scheduling

constraints

• Proactive scheme outperforms reactive management

• Understand interactions between memory divergence and control

divergence

Page 34: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(34)

OWL: Cooperative Thread Array Aware Scheduling

Techniques for Improving GPGPU Performance

A. Jog et. al ASPLOS 2013 Goal

• Understand memory effects of scheduling from deeper within the memory

hierarchy

• Minimize idle cycles induced by stalling warps waiting on memory

references

Page 35: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Off-chip Bandwidth is Critical!

35

Percentage of total execution cycles wasted waiting for the

data to come back from DRAM

0%

20%

40%

60%

80%

100%

SA

D

PV

C

SS

C

BF

S

MU

M

CF

D

KM

N

SC

P

FW

T

IIX

SP

MV

JP

EG

BF

SR

SC

FF

T

SD

2

WP

PV

R

BP

CO

N

AE

S

SD

1

BL

K

HS

SL

A

DN

LP

S

NN

PF

N

LY

TE

LU

D

MM

ST

O

CP

NQ

U

CU

TP

HW

TP

AF

AV

G

AV

G-T

1

Type-1

Applications

55%AVG: 32%

Type-2

Applications

GPGPU Applications

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 36: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(36)

Source of Idle Cycles

• Warps stalled on waiting for memory reference

❖ Cache miss

❖ Service at the memory controller

❖ Row buffer miss in DRAM

❖ Latency in the network (not addressed in this paper)

• The last warp effect

• The last CTA effect

• Lack of multiprogrammed execution

❖ One (small) kernel at a time

Page 37: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(37)

Impact of Idle Cycles

Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 38: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

High-Level View of a GPU

DRAM

SIMT Cores

Scheduler

ALUsL1 Caches

Threads

WW W W W W

Warps

L2 cache

Interconnect

CTA CTA CTA CTA

Cooperative

Thread

Arrays

(CTAs)

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 39: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Warp Scheduler

ALUsL1 Caches

CTA-Assignment Policy (Example)

39

Warp Scheduler

ALUsL1 Caches

Multi-threaded CUDA Kernel

SIMT Core-1 SIMT Core-2

CTA-1 CTA-2 CTA-3 CTA-4

CTA-2 CTA-4CTA-1 CTA-3

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 40: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(40)

Organizing CTAs Into Groups

• Set minimum number of warps equal to #pipeline stages

❖ Same philosophy as the two-level warp scheduler

• Use same CTA grouping/numbering across SMs?

Warp Scheduler

ALUsL1 Caches

Warp Scheduler

ALUsL1 Caches

SIMT Core-1 SIMT Core-2

CTA-2 CTA-4CTA-1 CTA-3

Figure from A. Jog et.al, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 41: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Warp Scheduling Policy◼ All launched warps on a SIMT core have equal priority

❑ Round-Robin execution

◼ Problem: Many warps stall at long latency operations

roughly at the same time

41

WW

CTA

WW

CTA

WW

CTA

WW

CTA

All warps compute

All warps have equal priority

WW

CTA

WW

CTA

WW

CTA

WW

CTA

All warps compute

All warps have equal priority

Send Memory Requests

SIMT

Core

Stalls

Time

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 42: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Solution

42

WW

CTA

WW

CTA

WW

CTA

WW

CTA

WW

CTA

WW

CTA

WW

CTA

WW

CTA

Send Memory Requests

Saved Cycles

WW

CTA

WW

CTA

WW

CTA

WW

CTA

All warps compute

All warps have equal priority

WW

CTA

WW

CTA

WW

CTA

WW

CTA

All warps compute

All warps have equal priority

Send Memory Requests

SIMT Core

Stalls

Time

• Form Warp-Groups

(Narasiman MICRO’11)

• CTA-Aware grouping

• Group Switch is Round-Robin

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 43: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(43)

Two Level Round Robin Scheduler

CTA0 CTA1

CTA3 CTA2

CTA4 CTA5

CTA7 CTA8

CTA12 CTA13

CTA15 CTA14

Group 0 Group 1

Group 3Group 2

RR

RR

Thread

Agnostic to whenpending misses

are satisfied

Page 44: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Objective 1: Improve Cache Hit Rates

44

CTA 1 CTA 3 CTA 5 CTA 7CTA 1 CTA 3 CTA 5 CTA 7

Data for CTA1 arrives.

No switching.

CTA 3 CTA 5 CTA 7CTA 1 CTA 3 C5 CTA 7CTA 1 C5

Data for CTA1 arrives.

TNo Switching: 4 CTAs in Time T

Switching: 3 CTAs in Time T

Fewer CTAs accessing the cache concurrently → Less cache contention

Time

Switch to CTA1.

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 45: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Reduction in L1 Miss Rates

45

◼ Limited benefits for cache insensitive applications

◼ What is happening deeper in the memory system?

0.00

0.20

0.40

0.60

0.80

1.00

1.20

SAD SSC BFS KMN IIX SPMV BFSR AVG.No

rmal

ized

L1

Mis

s R

ates

8%

18%

Round-Robin CTA-Grouping CTA-Grouping-Prioritization

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 46: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(46)

The Off-Chip Memory Path

Off-chip GDDR5

Off-chip GDDR5

CU 0

To

CU 15

CU 16

To

CU 31

MC

MC

MC

MC

MC MC

Off-chip GDDR5

Off-chip GDDR5

Off-chip GDDR5

Off-chip GDDR5

Access patterns?

Ordering and buffering?

Page 47: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(47)

Inter-CTA Locality

Warp Scheduler

ALUsL1 Caches

Warp Scheduler

ALUsL1 Caches

CTA-2 CTA-4CTA-1 CTA-3

DRAM DRAM DRAM DRAM

How do CTAs Interact at the MC and in DRAM?

Page 48: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(48)

Impact of the Memory Controller

• Memory scheduling

policies

❖ Optimize BW vs. memory

latency

• Impact of row buffer

access locality

• Cache lines?

Page 49: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(49)

Row Buffer Locality

Page 50: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

The DRAM Subsystem

Page 51: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(51)

DRAM Subsystem Organization

• Channel

• DIMM

• Rank

• Chip

• Bank

• Row/Column

51

Page 52: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(52)

Page Mode DRAM

• A DRAM bank is a 2D array of cells: rows x columns

• A “DRAM row” is also called a “DRAM page”

• “Sense amplifiers” also called “row buffer”

• Each address is a <row,column> pair

• Access to a “closed row”❖ Activate command opens row (placed into row buffer)

❖ Read/write command reads/writes column in the row buffer

❖ Precharge command closes the row and prepares the bank for next access

• Access to an “open row”❖ No need for activate command

52

Page 53: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(53)

DRAM Bank Operation

53

Row Buffer

(Row 0, Column 0)

Row

decoder

Column mux

Row address 0

Column address 0

Data

Row 0Empty

(Row 0, Column 1)

Column address 1

(Row 0, Column 85)

Column address 85

(Row 1, Column 0)

HITHIT

Row address 1

Row 1

Column address 0

CONFLICT !

Columns

Row

s

Access Address:

Page 54: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(54)

The DRAM Chip

• Consists of multiple banks (2-16 in Synchronous DRAM)

• Banks share command/address/data buses

• The chip itself has a narrow interface (4-16 bits per read)

54

Page 55: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(55)

128M x 8-bit DRAM Chip

55

Page 56: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(56)

DRAM Rank and Module

• Rank: Multiple chips operated together to form a wide interface

• All chips comprising a rank are controlled at the same time

❖ Respond to a single command

❖ Share address and command buses, but provide different data

❖ Like DRAM “SIMD”

• A DRAM module consists of one or more ranks

❖ E.g., DIMM (dual inline memory module)

❖ This is what you plug into your motherboard

• If we have chips with 8-bit interface, to read 8 bytes in a single access, use 8 chips in a DIMM

56

Page 57: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(57)

A 64-bit Wide DIMM (One Rank)

57

DRAM

Chip

DRAM

Chip

DRAM

Chip

DRAM

Chip

DRAM

Chip

DRAM

Chip

DRAM

Chip

DRAM

Chip

Command Data

Page 58: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(58)

Multiple DIMMs

58

• Advantages:

❖ Enables even higher capacity

• Disadvantages:

❖ Interconnect complexity and energy consumption can be high

Page 59: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(59)

DRAM Channels

• 2 Independent Channels: 2 Memory Controllers (Above)

• 2 Dependent/Lockstep Channels: 1 Memory Controller with wide interface (Not Shown above)59

Page 60: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(60)

Generalized Memory Structure

60

Page 61: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(61)

Generalized Memory Structure

61

Page 62: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

The DRAM Subsystem

The Top Down View

Page 63: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(63)

DRAM Subsystem Organization

• Channel

• DIMM

• Rank

• Chip

• Bank

• Row/Column

63

Page 64: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(64)

The DRAM subsystem

Memory channel Memory channel

DIMM (Dual in-line memory module)

Processor

“Channel”

Page 65: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(65)

Breaking down a DIMM

DIMM (Dual in-line memory module)

Side view

Front of DIMM Back of DIMM

Page 66: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(66)

Breaking down a DIMM

DIMM (Dual in-line memory module)

Side view

Front of DIMM Back of DIMM

Rank 0: collection of 8 chips Rank 1

Page 67: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(67)

Rank

Rank 0 (Front) Rank 1 (Back)

Data <0:63>CS <0:1>Addr/Cmd

<0:63><0:63>

Memory channel

Page 68: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(68)

Breaking down a Rank

Rank 0

<0:63>

Ch

ip 0

Ch

ip 1

Ch

ip 7. . .

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

Page 69: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(69)

Breaking down a Chip

Ch

ip 0

<0:7

>

Bank 0

<0:7>

<0:7>

<0:7>

...

<0:7

>

Page 70: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(70)

Breaking down a Bank

Bank 0

<0:7

>

row 0

row 16k-1

...

2kB

1B

1B (column)

1B

Row-buffer

1B

...

<0:7

>

Page 71: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(71)

DRAM Subsystem Organization

• Channel

• DIMM

• Rank

• Chip

• Bank

• Row/Column

71

Page 72: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(72)

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Channel 0

DIMM 0

Rank 0

Page 73: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(73)

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

. . .

Page 74: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(74)

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

Row 0Col 0

. . .

Page 75: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(75)

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

8B

Row 0Col 0

. . .

8B

Page 76: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(76)

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

8B

Row 0Col 1

. . .

Page 77: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(77)

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

8B

8B

Row 0Col 1

. . .

8B

Page 78: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(78)

Example: Transferring a cache block

0xFFFF…F

0x00

0x40

...

64B cache block

Physical memory space

Rank 0Chip 0 Chip 1 Chip 7

<0:7

>

<8:1

5>

<56

:63

>

Data <0:63>

8B

8B

Row 0Col 1

A 64B cache block takes 8 I/O cycles to transfer.

During the process, 8 columns are read sequentially.

. . .

Page 79: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(79)

Latency Components: Basic DRAM Operation

• CPU → controller transfer time

• Controller latency

❖ Queuing & scheduling delay at the controller

❖ Access converted to basic commands

• Controller → DRAM transfer time

• DRAM bank latency

❖ Simple CAS if row is “open” OR

❖ RAS + CAS if array precharged OR

❖ PRE + RAS + CAS (worst case)

• DRAM → CPU transfer time (through controller)

79

Page 80: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(80)

Multiple Banks (Interleaving) and Channels

• Multiple banks

❖ Enable concurrent DRAM accesses

❖ Bits in address determine which bank an address resides in

• Multiple independent channels serve the same purpose

❖ But they are even better because they have separate data buses

❖ Increased bus bandwidth

• Enabling more concurrency requires reducing

❖ Bank conflicts

❖ Channel conflicts

• How to select/randomize bank/channel indices in address?

❖ Lower order bits have more entropy

❖ Randomizing hash functions (XOR of different address bits)

80

Page 81: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(81)

How Multiple Banks/Channels Help

81

Page 82: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(82)

Multiple Channels

• Advantages

❖ Increased bandwidth

❖ Multiple concurrent accesses (if independent channels)

• Disadvantages

❖ Higher cost than a single channel

o More board wires

o More pins (if on-chip memory controller)

82

Page 83: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(83)

Address Mapping (Single Channel)

• Single-channel system with 8-byte memory bus

❖ 2GB memory, 8 banks, 16K rows & 2K columns per bank

• Row interleaving

❖ Consecutive rows of memory in consecutive banks

• Cache block interleavingo Consecutive cache block addresses in consecutive banks

o 64 byte cache blocks

o Accesses to consecutive cache blocks can be serviced in parallel

o How about random accesses? Strided accesses?83

Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)

Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)

3 bits8 bits

Page 84: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(84)

Bank Mapping Randomization

• DRAM controller can randomize the address mapping to banks so that bank conflicts are less likely

84

Column (11 bits)3 bits Byte in bus (3 bits)

XOR

Bank index

(3 bits)

Page 85: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(85)

Address Mapping (Multiple Channels)

• Where are consecutive cache blocks?

85

Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C

Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C

Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C

Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)C

Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)

3 bits8 bits

C

Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)

3 bits8 bits

C

Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)

3 bits8 bits

C

Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)

3 bits8 bits

C

Low Col. High ColumnRow (14 bits) Byte in bus (3 bits)Bank (3 bits)

3 bits8 bits

C

Page 86: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(86)

Interaction with Virtual→Physical Mapping

• Operating System influences where an address maps to in DRAM

• Operating system can control which bank/channel/rank a virtual page is mapped to.

• It can perform page coloring to minimize bank conflicts

• Or to minimize inter-application interference

86

Column (11 bits)Bank (3 bits)Row (14 bits) Byte in bus (3 bits)

Page offset (12 bits)Physical Frame number (19 bits)

Page offset (12 bits)Virtual Page number (52 bits) VA

PA

PA

Page 87: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(87)

DRAM Refresh (I)

• DRAM capacitor charge leaks over time

• The memory controller needs to read each row periodically to restore the charge

❖ Activate + precharge each row every N ms

❖ Typical N = 64 ms

• Implications on performance?

-- DRAM bank unavailable while refreshed

-- Long pause times: If we refresh all rows in burst, every 64ms the DRAM will be unavailable until refresh ends

• Burst refresh: All rows refreshed immediately after one another

• Distributed refresh: Each row refreshed at a different time, at regular intervals

87

Page 88: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(88)

DRAM Refresh (II)

• Distributed refresh eliminates long pause times

88

Page 89: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(89)

• Downsides of refresh

-- Energy consumption: Each refresh consumes energy

-- Performance degradation: DRAM rank/bank unavailable while refreshed

-- QoS/predictability impact: (Long) pause times during refresh

Downsides of DRAM Refresh

89

Page 90: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(90)

Back to the paper…

Page 91: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

CTA Data Layout (A Simple Example)

91

A(0,0) A(0,1) A(0,2) A(0,3)

:

:

DRAM Data Layout (Row Major)

Bank 1 Bank 2 Bank

3

Bank 4

A(1,0) A(1,1) A(1,2) A(1,3)

:

:

A(2,0) A(2,1) A(2,2) A(2,3)

:

:

A(3,0) A(3,1) A(3,2) A(3,3)

:

:

Data Matrix

A(0,0) A(0,1) A(0,2) A(0,3)

A(1,0) A(1,1) A(1,2) A(1,3)

A(2,0) A(2,1) A(2,2) A(2,3)

A(3,0) A(3,1) A(3,2) A(3,3)

mapped to Bank 1

CTA 1 CTA 2

CTA 3 CTA 4

mapped to Bank 2

mapped to Bank 3

mapped to Bank 4

Average percentage of consecutive CTAs (out of

total CTAs) accessing the same row = 64%

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 92: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

L2 Cache

Implications of high CTA-row sharing

92

W

CTA-1

W

CTA-3

W

CTA-2

W

CTA-4

SIMT Core-1 SIMT Core-2

Bank-1 Bank-2 Bank-3 Bank-4

Row-1 Row-2 Row-3 Row-4

Idle

Banks

W W W W

CTA Prioritization Order CTA Prioritization Order

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 93: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

High Row Locality

Low Bank Level Parallelism

Bank-1

Row-1

Bank-2

Row-2

Bank-1

Row-1

Bank-2

Row-2

Req

Req

Req

Req

Req

Req

Req

Req

Req

Req

Lower Row Locality

Higher Bank Level

Parallelism

Req

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 94: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(94)

Some Additional Details

• Spread reference from multiple CTAs (on multiple SMs) across row

buffers in the distinct banks

• Do not use same CTA group prioritization across SMs

❖ Play the odds

• What happens with applications with unstructured, irregular memory

access patterns?

Page 95: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

L2 Cache

Objective 2: Improving Bank Level Parallelism

W

CTA-1

W

CTA-3

W

CTA-2

W

CTA-4

SIMT Core-1 SIMT Core-2

Bank-1 Bank-2 Bank-3 Bank-4

Row-1 Row-2 Row-3 Row-4

W W W W

11% increase in bank-level parallelism

14% decrease in row buffer locality

CTA Prioritization OrderCTA Prioritization Order

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 96: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

L2 Cache

Objective 3: Recovering Row Locality

W

CTA-1

W

CTA-3

W

CTA-2

W

CTA-4

SIMT Core-2

Bank-1 Bank-2 Bank-3 Bank-4

Row-1 Row-2 Row-3 Row-4

W W W W

Memory Side

Prefetching

L2 Hits!

SIMT Core-1

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 97: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

Memory Side Prefetching◼ Prefetch the so-far-unfetched cache lines in an already open row into the L2 cache, just before it

is closed

◼ What to prefetch?

❑ Sequentially prefetches the cache lines that were not accessed by demand requests

❑ Sophisticated schemes are left as future work

◼ When to prefetch?

❑ Opportunistic in Nature

❑ Option 1: Prefetching stops as soon as demand request comes for another row. (Demands are

always critical)

❑ Option 2: Give more time for prefetching, make demands wait if there are not many. (Demands

are NOT always critical)

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 98: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

IPC results (Normalized to Round-Robin)

0.6

1.0

1.4

1.8

2.2

2.6

3.0

SA

D

PV

C

SS

C

BF

S

MU

M

CF

D

KM

N

SC

P

FW

T

IIX

SP

MV

JP

EG

BF

SR

SC

FF

T

SD

2

WP

PV

R

BP

AV

G -

T1

Norm

aliz

ed IP

C

Objective 1 Objective (1+2) Objective (1+2+3) Perfect-L2

25% 31% 33%

◼ 11% within Perfect L2

44%

Courtesy A. Jog, “OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance, “ ASPLOS 2013

Page 99: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(99)

Summary

• Coordinated scheduling across SMs, CTAs, and warps

• Consideration of effects deeper in the memory system

• Coordinating warp residence in the core with the presence of

corresponding lines in the cache

Page 100: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(100)

CAWA: Coordinated Warp Scheduling and Cache

Prioritization for Critical Warp Acceleration in GPGPU

Workloads S. –Y Lee, A. A. Kumar and C. J Wu

ISCA 2015

Goal• Reduce warp divergence and hence increase throughput

• The key is the identification of critical (lagging) warps

• Manage resources and scheduling decisions to speed up the execution of

critical warps thereby reducing divergence

Page 101: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(101)

Review: Resource Limits on Occupancy

SM Scheduler

Kernel Distributor

SM SM SM SM

DRAM

Limits the #threads

Limits the #thread blocks

Warp Schedulers

Warp ContextWarp ContextWarp ContextWarp Context

Register File

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

SP

TB 0

Thread Block Control

L1/Shared MemoryLimits the #thread

blocks

Limits the #threads

SM – Stream Multiprocessor

SP – Stream Processor

Locality effects

Page 102: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(102)

Evolution of Warps in TB

• Coupled lifetimes of warps in a TB

❖ Start at the same time

❖ Synchronization barriers

❖ Kernel exit (implicit synchronization barrier)

Completed warps

Figure from P. Xiang, Et. Al, “ Warp Level Divergence: Characterization, Impact, and Mitigation

Region where latency hiding is less effective

Page 103: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(103)

Warp Criticality Problem

Host CPU

Inte

rcon

nec

tio

n B

us

GPU

SMX SMX SMX SMX

Kernel Distributor

SMX Scheduler Core Core Core Core

Registers

L1 Cache / Shard Memory

Warp Schedulers

Warp Context

Ker

ne

l Man

age

me

nt

Un

it

HW

Wor

k Q

ueu

esPe

nd

ing

Ker

nel

s

Memory Controller

PC Dim Param ExeBLKernel Distributor Entry

Control Registers

DRAML2 Cache

TB

Warp

Warp

Available registers: spatial underutilization

Registers allocated to completed warps in the

TB: temporal underutilization

The last warp

Warp Context

Completed (idle) warp contexts

Manage resources and schedules around Critical Warps

Temporal & spatial underutilization

Page 104: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(104)

The Warp Criticality Problem

• Significant warp execution disparity for warps in the same thread block

1.5

1.4

1.3

1.2

1.1

1.0

0.9

0.8

w0

w1

w2

w3

w4

w5

w6

w7

w8

w9

w1

0

w1

1

w1

2

w1

3

w1

4

w1

5

Exe

cu6

on

Tim

e (

Co

mp

are

dto

the

Fa

ste

st W

arp

)

Warps (Sorted by Cri6cality)

*Lee and Wu, “CAWS: Cri)cality-‐Aware Warp Scheduling for GPGPU Workloads,” PACT-‐2014

Breadth-‐first-‐search

0%

25%

50%

75%

bfs

b+

tree

he

artw

all

kmea

ns

nee

dle

srad

_1

strc

ltr_

smal

l

Avg

Exe

cu6

on

Tim

e D

isp

arit

y

Computation Idle Time

100%

40%

Page 105: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(105)

Research Questions

• What is the source of warp criticality?

• How can we effectively accelerate critical warp execution?

5/28

Page 106: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(106)

Source of Warp Criticality

• Workload Imbalance

• Diverging Branch Behavior

• Memory Contention and Memory Access Latency

• Execution Order of Warp Scheduling

7/28

Page 107: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(107)

Workload Imbalance & Diverging Branch

8/28

0.9

1.0

1.1

1.21.51.41.31.21.11.00.90.8

w0

w1

w2

w3

w4

w5

w6

w7

w8

w9

w1

0

w1

1

w1

2

w1

3

w1

4

w1

5

Inst

Co

un

t N

orm

aliz

edto

th

e F

aste

st W

arp

Exe

cu6

on

Tim

e N

orm

aliz

ed

to

th

e F

aste

st W

arp

Warps (Sorted by Cri6cality)

• Workload imbalance or diverging branch behavior makes warps have different number of dynamic instruction counts.

exe time inst count

Page 108: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(108)

Memory Contention

9/28

w0

w1

w2

w3

w4

w5

w6

w7

w8

w9

w1

0

w1

1

w1

2

w1

3

w1

4

w1

5Exe

cu6

on

Tim

e N

orm

aliz

ed

to

th

e F

aste

st W

arp

Warps (Sorted by Cri6cality)

• While warps experience different latency to access memory, memory contention can induce warp criticality.

Total Memory System Latency Other Latency2.5

2

1.5

1

0.5

0

Page 109: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(109)

Warp Scheduling Order

10/28

w0

w1

w2

w3

w4

w5

w6

w7

w8

w9

w1

0

w1

1

w1

2

w1

3

w1

4

w1

5Exe

cu6

on

Tim

e N

orm

aliz

ed

to

th

e F

aste

st W

arp

Warps (Sorted by Cri6cality)

• The warp scheduler may introduce additional stall cycles for a ready warp, resulting in warp criticality

Scheduler Latency Other Latency2.5

2

1.5

1

0.5

0

Page 110: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(110)

CAWA: Criticality-Aware Warp Acceleration

Coordinated warp scheduling and cache prioritization design

• Criticality Prediction Logic (CPL)– Predicting and identifying the critical warp at runtime

• greedy Criticality Aware Warp Scheduler (gCAWS)

– Prioritizing and accelerating the critical warp execution

• Criticality-‐Aware Cache Prioritization (CACP)– Prioritizing and allocating cache lines for critical warp reuse

Page 111: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(111)

CAWA: Criticality-Aware Warp Acceleration

• Criticality Prediction Logic (CPL)– Predic6ng and iden6fying the cri6cal warp at run6me

• greedy Criticality Aware Warp Scheduler (gCAWS)

– Prioritizing and accelerating the critical warp execution

• Criticality-‐Aware Cache Prioritization (CACP)

– Prioritizing and allocating cache lines for critical warp reuse

Page 112: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(112)

CAWACPL : Criticality Prediction Logic

14/28

• Evaluating number of additional cycles a warp may experience

• nInst is decremented whenever an instruction is executed

Criticality = nInst * w.CPIavg + nStall

instruction count disparity diverging branch

memory latency scheduling latency

Page 113: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(113)

• Criticality Prediction Logic (CPL)– Predicting and identifying the critical warp at runtime

• greedy Criticality Aware Warp Scheduler (gCAWS)– Prioritizing and accelerating the critical warp execu6on

• Criticality-‐Aware Cache Prioritization (CACP)

– Prioritizing and allocating cache lines for critical warp reuse

CAWA: Criticality-Aware Warp Acceleration

Reduce delay from the scheduler

Page 114: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(114)

• • Prioritizing warps based on their criticality given by CPL

• • Executing warps in a greedy* manner

• – Select the most critical ready-‐warp

• – Keep on executing the select warp until it stalls

• Warp Scheduler Selec6on Sequence

• Traditional Approach (e.g. RR, 2L, GTO):

• W0→W1→W2→W3

• gCAWS:

• W1→W3→W0→W2

16/28

*Rogers et al., “Cache-‐Conscious Wavefront Scheduler,” MICRO’12

Warp Pool Criticality

Warp 0 5

Warp 1 10

Warp 2 3

Warp 3 7

CAWAgCAWS : greedy Criticality-Aware Warp

Scheduler

Page 115: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(115)

CAWA: Criticality-Aware Warp Acceleration

• Criticality Prediction Logic (CPL)– Predicting and identifying the critical warp at runtime

• greedy Criticality Aware Warp Scheduler (gCAWS)

– Prioritizing and accelerating the critical warp execution

• Cri6cality-‐Aware Cache Priori6za6on (CACP)– Priori6zing and alloca6ng cache lines for cri6cal warp reuse

Reduce delay from the memory

Page 116: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(116)

CAWACACP : Criticality-Aware Cache Prioritization

*Wu et al., “SHiP: Signatured-‐based Hit Predictor for High Performance Caching,” MICRO’1119/28

Page 117: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(117)20/28

Cache Request

CAWACACP : Criticality-Aware Cache

Prioritization

Page 118: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(118)

No

rmal

ize

d M

PK

I1.2

1.1

1.0

0.91.21.11.00.90.80.7

No

rmal

ize

dIP

C

CAWA2L GTO3.13x

2.54x

Overall Performance Improvement

23/28

CAWA aims to retain data for critical warps

Improving GPU performance via large warps and two-‐level warp scheduling. Narasiman et al. MICRO `11. Cache conscious wavefront scheduling. Rogers, O’Connor, and Aamodt. MICRO `12.

Page 119: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(119)

1.301.251.201.151.101.051.000.950.90

IPC

No

rmal

ized

to

Bas

elin

e R

R

gCAWS CAWA

2.63x3.13x

gCAWS Performance Improvement

24/28

Large memory footprint

Page 120: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(120)

Performance Improvement with CAWACACP

25/28

1.00

0.75

0.50

1.25

1.50

MP

KI N

orm

aliz

ed

to B

aselin

e R

RRR+CACP 2L + CACP GTO+CACP CAWA

Schedulers need modificationfor robust and higher

performance improvement withCACP

CACP can effectively reduce

cache interference with any schedulers

Page 121: CS 758: Advanced Topics in Computer Architecturesinclair/courses/cs758/...OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance A. Jog et. al ASPLOS

(121)

Summary

• Warp divergence leads to some lagging warps → critical warps

• Expose the performance impact of critical warps → throughput reduction

• Coordinate scheduler and cache management to reduce warp divergence