Outline

Effect of Instruction Fetch and Memory Scheduling on

GPU Performance

Nagesh B Lakshminarayana, Hyesoon Kim

Outline

• Background and Motivation• Policies• Experimental Setup• Results• Conclusion

2

GPU Architecture (based on Tesla Architecture)

SM – Streaming Multiprocessor SP – Scalar Processor SIMT – Single Instruction Multiple Thread

3

SM Architecture (based on Tesla Architecture)

• Fetch Mechanism– Fetch 1 instruction for selected warp– Stall Fetch for warp when it executes a

Load/Store or when it encounters a Branch

• Scheduler Policy– Oldest first and Inorder (within warp)

• Caches– I Cache, Shared Memory, Constant Cache

and Texture Cache

4

Handling Multiple Memory Requests

• MSHR/Memory Request Queue – Allows merging of memory requests

(Intra-core)

• DRAM Controller– Allows merging of memory requests

(Inter-core)

5

Intra-core Merging

6

Code Example - Intra-Core Merging• From MonteCarlo in CUDA SDK

for(iSum = threadIdx.x; iSum < SUM_N; iSum += blockDim.x)

{…for(int i = iSum; i < pathN; i += SUM_N){

real r = d_Samples[i];real callValue = endCallValue(S, X, r, MuByT,

VBySqrtT);sumCall.Expected += callValue;sumCall.Confidence += callValue * callValue;

} …

}7

iSum 0, 2 = 2

iSum 1, 2 = 2

iSum 2, 2 = 2

i 0, 2 = 2i 1, 2 = 2i 2, 2 = 2

r 0, 2 = r 1, 2 = r 2, 2 = d_Samples[2]

A X, Y X – Block Id, Y – Thread Id

multiple blocks are assigned to the same SM

threads with corresponding Ids in different blocks access the same memory locations

Inter-core Merging

8

Why look at Fetch?

• Allows implicit control over resources allocated to a warp

• Can control progress of a warp• Can boost performance by fetching

more for critical warps

• Implicit resource control within a core

9

• Memory System is a performance bottleneck for several applications

• DRAM scheduling decides the order in which memory requests are granted

• Can prioritize warps based on criticality

• Implicit performance control across cores

Why look at DRAM Scheduling?

10

By controlling Fetch and DRAM Scheduling we can control

performance

11

How is This Useful?

• Understand applications and their behavior better

• Detect patterns or behavioral groups across applications

• Design new policies for GPGPU applications to improve performance

12

Outline

• Background and Motivation• Policies • Experimental Setup• Results• Conclusion

13

Fetch Policies

• Round Robin (RR) [default in Tesla architecture]

• FAIR– Ensures uniform progress of all warps

• ICOUNT [Tullsen’96]– Same as ICOUNT in SMT– Tries to increase throughput by giving

priority to fast moving threads

• Least Recently Fetched (LRF) – Prevents starvation of warps

14

New Oracle Based Fetch Policies

• ALL– Gives priority to longer warps (total

length until termination)– Ensures all warps finish at the same

time, this results in higher occupancy

15

Priorities:warp 0 > warp 1 > warp 2 > warp 3


• BAR– Gives priority to warps with greater

number of instructions to next barrier– Idea is to reduce wait time at barriers

16




• MEM_BAR– Similar to BAR but gives higher priority to

warps with more memory instructions

17

Priorities: warp 0 > warp 2 > warp 1 = warp 3

Priorities: warp 1 > warp 0 = warp 2 > warp 3

Priority(Wa) > Priority(Wb)If MemInst(Wa) > MemInst(Wb) or

If MemInst(Wa) = MemInst(Wb) AND Inst(Wa) > Inst(Wb)

DRAM Scheduling Policies

• FCFS• FRFCFS [Rixner’00]• FR_FAIR (new policy)

– Row hit with fairness– Ensures uniform progress of warps

• REM_INST (new Oracle based policy)– Row hit with priority for warps with

greater number of instructions remaining for termination

– Prioritizes longer warps

18

Outline


19

Experimental Setup

• Simulated GPU Architecture– 8 SMs– Frontend : 1 wide, 1KB I Cache, branch stall– Execution : 8 wide SIMD execution unit, IO

scheduling, 4 cycle latency for most instructions

– Caches : 64KB software managed cache, 8 load accesses/cycle

– Memory : 32B wide bus, 8 DRAM banks– RR fetch, FRFCFS DRAM scheduling (baseline)

• Trace driven, cycle accurate simulator• Per warp traces generated using GPU

Ocelot[Kerr’09]20

Benchmarks

• Taken from– CUDA SDK 2.2 – MonteCarlo, Nbody,

ScalarProd– PARBOIL[UIUC’09] – MRI-Q, MRI-FHD, CP, PNS– RODINIA[Che’09] – Leukocyte, Cell, Needle

• Classification based on lengths of warps– Symmetric, if <= 2% divergence– Asymmetric, otherwise (results included in

paper)

21

Outline


22

FRFCFS

00.20.40.60.8

11.21.4

No

rmal

ized

Exe

cuti

on

Du

rati

on

ICOUNT

LRF

FAIR

ALL

BAR

MEM_BAR

Results - Symmetric Applications

• Compute intensive – no variation with different fetch policies

• Memory bound – improvement with fairness oriented fetch policies i.e., FAIR, ALL, BAR, MEM_BAR

Baseline : RR + FRFCFS

23

FRFAIR

00.20.40.60.8

11.21.4

No

rmal

ized

Exe

cuti

on

Du

rati

on

RRICOUNTLRFFAIRALLBARMEM_BAR

Results – Symmetric Applications

• On average, better than FRFCFS• MersenneTwister shows huge improvement• REM_INST DRAM policy performs similar to FRFAIR

24


MonteCarlo

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Normalized Execution Time

Ratio of Merged MemoryRequests

Analysis: MonteCarlo

• Fairness oriented fetch policies improve performance by increasing intra-core merging

25

FRFCFS DRAM Scheduling

MersenneTwister

0

0.2

0.4

0.60.8

1

1.2

RR

ICO

UN

TLR

FFA

IRA

LLB

AR

ME

M_B

AR RR

ICO

UN

TLR

FFA

IRA

LLB

AR

ME

M_B

AR RR

ICO

UN

TLR

FFA

IRA

LLB

AR

ME

M_B

AR

FRFCFS FRFAIR REM_INST


DRAM Row Buffer Hit Ratio

Analysis: MersenneTwister

• FAIR DRAM Scheduling (FRFAIR, REM_INST) improves performance by increasing DRAM Row Buffer Hit ratio

26


BlackScholes

0

0.2

0.4

0.6

0.8

1

1.2

1.4


MLP/100

Analysis: BlackScholes

FRFCFS DRAM Scheduling

• Fairness oriented fetch policies increase MLP• Increased (MLP + Row Buffer Hit ratio) improves

performance

27

Outline


28

Conclusion

• Compute intensive applications– Fetch and DRAM Scheduling do not matter

• Symmetric memory intensive applications– Fairness oriented Fetch (FAIR, ALL, BAR,

MEM_BAR) and DRAM policies (FR_FAIR, REM_INST) provide performance improvement• MonteCarlo(40%),MersenneTwister(50%),

BlackScholes(18%)

• Asymmetric memory intensive applications– No correlation between performance and

Fetch and DRAM Scheduling policies

29

THANK YOU!

30

Outline

Documents

Transcript of Outline