Outline
description
Transcript of Outline
![Page 1: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/1.jpg)
Effect of Instruction Fetch and Memory Scheduling on
GPU Performance
Nagesh B Lakshminarayana, Hyesoon Kim
![Page 2: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/2.jpg)
Outline
• Background and Motivation• Policies• Experimental Setup• Results• Conclusion
2
![Page 3: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/3.jpg)
GPU Architecture (based on Tesla Architecture)
SM – Streaming Multiprocessor SP – Scalar Processor SIMT – Single Instruction Multiple Thread
3
![Page 4: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/4.jpg)
SM Architecture (based on Tesla Architecture)
• Fetch Mechanism– Fetch 1 instruction for selected warp– Stall Fetch for warp when it executes a
Load/Store or when it encounters a Branch
• Scheduler Policy– Oldest first and Inorder (within warp)
• Caches– I Cache, Shared Memory, Constant Cache
and Texture Cache
4
![Page 5: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/5.jpg)
Handling Multiple Memory Requests
• MSHR/Memory Request Queue – Allows merging of memory requests
(Intra-core)
• DRAM Controller– Allows merging of memory requests
(Inter-core)
5
![Page 6: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/6.jpg)
Intra-core Merging
6
![Page 7: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/7.jpg)
Code Example - Intra-Core Merging• From MonteCarlo in CUDA SDK
for(iSum = threadIdx.x; iSum < SUM_N; iSum += blockDim.x)
{…for(int i = iSum; i < pathN; i += SUM_N){
real r = d_Samples[i];real callValue = endCallValue(S, X, r, MuByT,
VBySqrtT);sumCall.Expected += callValue;sumCall.Confidence += callValue * callValue;
} …
}7
iSum 0, 2 = 2
iSum 1, 2 = 2
iSum 2, 2 = 2
i 0, 2 = 2i 1, 2 = 2i 2, 2 = 2
r 0, 2 = r 1, 2 = r 2, 2 = d_Samples[2]
A X, Y X – Block Id, Y – Thread Id
multiple blocks are assigned to the same SM
threads with corresponding Ids in different blocks access the same memory locations
![Page 8: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/8.jpg)
Inter-core Merging
8
![Page 9: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/9.jpg)
Why look at Fetch?
• Allows implicit control over resources allocated to a warp
• Can control progress of a warp• Can boost performance by fetching
more for critical warps
• Implicit resource control within a core
9
![Page 10: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/10.jpg)
• Memory System is a performance bottleneck for several applications
• DRAM scheduling decides the order in which memory requests are granted
• Can prioritize warps based on criticality
• Implicit performance control across cores
Why look at DRAM Scheduling?
10
![Page 11: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/11.jpg)
By controlling Fetch and DRAM Scheduling we can control
performance
11
![Page 12: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/12.jpg)
How is This Useful?
• Understand applications and their behavior better
• Detect patterns or behavioral groups across applications
• Design new policies for GPGPU applications to improve performance
12
![Page 13: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/13.jpg)
Outline
• Background and Motivation• Policies • Experimental Setup• Results• Conclusion
13
![Page 14: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/14.jpg)
Fetch Policies
• Round Robin (RR) [default in Tesla architecture]
• FAIR– Ensures uniform progress of all warps
• ICOUNT [Tullsen’96]– Same as ICOUNT in SMT– Tries to increase throughput by giving
priority to fast moving threads
• Least Recently Fetched (LRF) – Prevents starvation of warps
14
![Page 15: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/15.jpg)
New Oracle Based Fetch Policies
• ALL– Gives priority to longer warps (total
length until termination)– Ensures all warps finish at the same
time, this results in higher occupancy
15
Priorities:warp 0 > warp 1 > warp 2 > warp 3
![Page 16: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/16.jpg)
New Oracle Based Fetch Policies
• BAR– Gives priority to warps with greater
number of instructions to next barrier– Idea is to reduce wait time at barriers
16
Priorities:warp 0 > warp 1 > warp 2 > warp 3
Priorities:warp 2 > warp 1 > warp 0 > warp 3
![Page 17: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/17.jpg)
New Oracle Based Fetch Policies
• MEM_BAR– Similar to BAR but gives higher priority to
warps with more memory instructions
17
Priorities: warp 0 > warp 2 > warp 1 = warp 3
Priorities: warp 1 > warp 0 = warp 2 > warp 3
Priority(Wa) > Priority(Wb)If MemInst(Wa) > MemInst(Wb) or
If MemInst(Wa) = MemInst(Wb) AND Inst(Wa) > Inst(Wb)
![Page 18: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/18.jpg)
DRAM Scheduling Policies
• FCFS• FRFCFS [Rixner’00]• FR_FAIR (new policy)
– Row hit with fairness– Ensures uniform progress of warps
• REM_INST (new Oracle based policy)– Row hit with priority for warps with
greater number of instructions remaining for termination
– Prioritizes longer warps
18
![Page 19: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/19.jpg)
Outline
• Background and Motivation• Policies • Experimental Setup• Results• Conclusion
19
![Page 20: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/20.jpg)
Experimental Setup
• Simulated GPU Architecture– 8 SMs– Frontend : 1 wide, 1KB I Cache, branch stall– Execution : 8 wide SIMD execution unit, IO
scheduling, 4 cycle latency for most instructions
– Caches : 64KB software managed cache, 8 load accesses/cycle
– Memory : 32B wide bus, 8 DRAM banks– RR fetch, FRFCFS DRAM scheduling (baseline)
• Trace driven, cycle accurate simulator• Per warp traces generated using GPU
Ocelot[Kerr’09]20
![Page 21: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/21.jpg)
Benchmarks
• Taken from– CUDA SDK 2.2 – MonteCarlo, Nbody,
ScalarProd– PARBOIL[UIUC’09] – MRI-Q, MRI-FHD, CP, PNS– RODINIA[Che’09] – Leukocyte, Cell, Needle
• Classification based on lengths of warps– Symmetric, if <= 2% divergence– Asymmetric, otherwise (results included in
paper)
21
![Page 22: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/22.jpg)
Outline
• Background and Motivation• Policies • Experimental Setup• Results• Conclusion
22
![Page 23: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/23.jpg)
FRFCFS
00.20.40.60.8
11.21.4
No
rmal
ized
Exe
cuti
on
Du
rati
on
ICOUNT
LRF
FAIR
ALL
BAR
MEM_BAR
Results - Symmetric Applications
• Compute intensive – no variation with different fetch policies
• Memory bound – improvement with fairness oriented fetch policies i.e., FAIR, ALL, BAR, MEM_BAR
Baseline : RR + FRFCFS
23
![Page 24: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/24.jpg)
FRFAIR
00.20.40.60.8
11.21.4
No
rmal
ized
Exe
cuti
on
Du
rati
on
RRICOUNTLRFFAIRALLBARMEM_BAR
Results – Symmetric Applications
• On average, better than FRFCFS• MersenneTwister shows huge improvement• REM_INST DRAM policy performs similar to FRFAIR
24
Baseline : RR + FRFCFS
![Page 25: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/25.jpg)
MonteCarlo
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Normalized Execution Time
Ratio of Merged MemoryRequests
Analysis: MonteCarlo
• Fairness oriented fetch policies improve performance by increasing intra-core merging
25
FRFCFS DRAM Scheduling
![Page 26: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/26.jpg)
MersenneTwister
0
0.2
0.4
0.60.8
1
1.2
RR
ICO
UN
TLR
FFA
IRA
LLB
AR
ME
M_B
AR RR
ICO
UN
TLR
FFA
IRA
LLB
AR
ME
M_B
AR RR
ICO
UN
TLR
FFA
IRA
LLB
AR
ME
M_B
AR
FRFCFS FRFAIR REM_INST
Normalized Execution Time
DRAM Row Buffer Hit Ratio
Analysis: MersenneTwister
• FAIR DRAM Scheduling (FRFAIR, REM_INST) improves performance by increasing DRAM Row Buffer Hit ratio
26
Baseline : RR + FRFCFS
![Page 27: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/27.jpg)
BlackScholes
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Normalized Execution Time
MLP/100
Analysis: BlackScholes
FRFCFS DRAM Scheduling
• Fairness oriented fetch policies increase MLP• Increased (MLP + Row Buffer Hit ratio) improves
performance
27
![Page 28: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/28.jpg)
Outline
• Background and Motivation• Policies • Experimental Setup• Results• Conclusion
28
![Page 29: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/29.jpg)
Conclusion
• Compute intensive applications– Fetch and DRAM Scheduling do not matter
• Symmetric memory intensive applications– Fairness oriented Fetch (FAIR, ALL, BAR,
MEM_BAR) and DRAM policies (FR_FAIR, REM_INST) provide performance improvement• MonteCarlo(40%),MersenneTwister(50%),
BlackScholes(18%)
• Asymmetric memory intensive applications– No correlation between performance and
Fetch and DRAM Scheduling policies
29
![Page 30: Outline](https://reader035.fdocuments.us/reader035/viewer/2022062723/56813fae550346895daa9364/html5/thumbnails/30.jpg)
THANK YOU!
30