GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs
description
Transcript of GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs
![Page 1: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/1.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs
Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal
Dept. of Computer Science and EngineeringThe Ohio State UniversityColumbus, OH, USA
![Page 2: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/2.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GPU Programming Gets Popular
• Many domains are using GPUs for high performance
2
GPU-accelerated Molecular Dynamics GPU-accelerated Seismic Imaging
• Available in both high-end/low-end systems• the #1 supercomputer in the world uses GPUs [TOP500, Nov 2012]• commodity desktops/laptops equipped with GPUs
![Page 3: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/3.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• Need careful management of • a large amount of threads
Writing Efficient GPU Programs is Challenging
3
Thread Blocks
![Page 4: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/4.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• Need careful management of • a large amount of threads• multi-layer memory hierarchy
Writing Efficient GPU Programs is Challenging
4
Read-only Data Cache
DRAM (Device Memory)
L2 Cache
L1Cache
SharedMemory
Thread
Thread Blocks
Kepler GK110 Memory Hierarchy
![Page 5: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/5.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• Need careful management of • a large amount of threads• multi-layer memory hierarchy
Writing Efficient GPU Programs is Challenging
5
Read-only Data Cache
DRAM (Device Memory)
L2 Cache
L1Cache
SharedMemory
Thread
Thread Blocks
Fast but Small
Large but Slow
Kepler GK110 Memory Hierarchy
![Page 6: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/6.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
Writing Efficient GPU Programs is Challenging
6
Which data in shared memory are infrequently accessed?
Which data in device memory are frequently accessed?
Read-only Data Cache
DRAM (Device Memory)
L2 Cache
L1Cache
SharedMemory
Thread
Kepler GK110 Memory Hierarchy
![Page 7: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/7.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• Existing tools can’t help much• inapplicable to GPU• coarse-grained• prohibitive runtime overhead• cannot handle irregular/indirect accesses
Writing Efficient GPU Programs is Challenging
7
Which data in shared memory are infrequently accessed?
Which data in device memory are frequently accessed?
Read-only Data Cache
DRAM (Device Memory)
L2 Cache
L1Cache
SharedMemory
Thread
Kepler GK110 Memory Hierarchy
![Page 8: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/8.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
Outline
• Motivation• GMProf
• Naïve Profiling Approach• Optimizations• Enhanced Algorithm
• Evaluation• Conclusions
8
![Page 9: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/9.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-basic: The Naïve Profiling Approach
9
• Shared Memory Profiling• integer counters to count accesses to shared memory• one counter for each shared memory element• atomically update the counter
• to avoid race condition among threads
• Device Memory Profiling• integer counters to count accesses to device memory• one counter for each element in the user device memory array
• since device memory is too large to be monitored as a whole (e..g, 6GB)• atomically update the counter
![Page 10: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/10.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
Outline
• Motivation• GMProf
• Naïve Profiling Approach• Optimizations• Enhanced Algorithm
• Evaluation• Conclusions
10
![Page 11: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/11.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SA: Static Analysis Optimization
11
• Observation I: Many memory accesses can be determined statically
1. __shared__ int s[];
2. …
3. s[threadIdx.x] = 3;
![Page 12: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/12.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SA: Static Analysis Optimization
12
• Observation I: Many memory accesses can be determined statically
1. __shared__ int s[];
2. …
3. s[threadIdx.x] = 3;
Don’t need to count the access at runtime
![Page 13: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/13.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SA: Static Analysis Optimization
13
• Observation I: Many memory accesses can be determined statically
1. __shared__ int s[];
2. …
3. s[threadIdx.x] = 3;
Don’t need to count the access at runtime
• How about this …
1. __shared__ float s[];
2. …
3. for(r=0; …; …) {
4. for(c=0; …; …) {
5. temp = s[input[c]];
6. }
7. }y
![Page 14: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/14.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SA: Static Analysis Optimization
14
• Observation II: Some accesses are loop-invariant• E.g. s[input[c]] is irrelavant to the outer loop iterator r
1. __shared__ float s[];
2. …
3. for(r=0; …; …) {
4. for(c=0; …; …) {
5. temp = s[input[c]];
6. }
7. }y
![Page 15: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/15.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SA: Static Analysis Optimization
15
• Observation II: Some accesses are loop-invariant• E.g. s[input[c]] is irrelavant to the outer loop iterator r
1. __shared__ float s[];
2. …
3. for(r=0; …; …) {
4. for(c=0; …; …) {
5. temp = s[input[c]];
6. }
7. }y
Don’t need to profile in every r
iteration
![Page 16: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/16.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SA: Static Analysis Optimization
16
• Observation II: Some accesses are loop-invariant• E.g. s[input[c]] is irrelavant to the outer loop iterator r
1. __shared__ float s[];
2. …
3. for(r=0; …; …) {
4. for(c=0; …; …) {
5. temp = s[input[c]];
6. }
7. }y
Don’t need to profile in every r
iteration
• Observation III: Some accesses are tid-invariant• E.g. s[input[c]] is irrelavant to threadIdx
![Page 17: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/17.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SA: Static Analysis Optimization
17
• Observation II: Some accesses are loop-invariant• E.g. s[input[c]] is irrelavant to the outer loop iterator r
1. __shared__ float s[];
2. …
3. for(r=0; …; …) {
4. for(c=0; …; …) {
5. temp = s[input[c]];
6. }
7. }y
Don’t need to profile in every r
iteration
• Observation III: Some accesses are tid-invariant• E.g. s[input[c]] is irrelavant to threadIdx Don’t need to update the
counter in every thread
![Page 18: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/18.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-NA: Non-Atomic Operation Optimization
18
• Atomic operation cost a lot• Serialize all concurrent threads when updating a shared counter
• Use non-atomic operation to update counters• does not impact the overall accuracy thanks to other optimizations
atomicAdd(&counter, 1);
…
…
concurrent threads serialized threads
![Page 19: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/19.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SM: Shared Memory Counters Optimization
19
• Make full use of shared memory• Store counters in shared memory
when possible• Reduce counter size
• E.g., 32-bit integer counters -> 8-bit
Read-only Data Cache
Device Memory
L2 Cache
L1Cache
SharedMemory
Fast but Small
![Page 20: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/20.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-SM: Shared Memory Counters Optimization
20
• Make full use of shared memory• Store counters in shared memory
when possible• Reduce counter size
• E.g., 32-bit integer counters -> 8-bit
Read-only Data Cache
Device Memory
L2 Cache
L1Cache
SharedMemory
Fast but Small
GMProf-TH: Threshold Optimization• Precise count may not be necessary
• E.g A is accessed 10 times, while B is accessed > 100 times
• Stop counting once reaching certain threshold• Tradeoff between accuracy and overhead
![Page 21: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/21.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
Outline
• Motivation• GMProf
• Naïve Profiling Approach• Optimizations• Enhanced Algorithm
• Evaluation• Conclusions
21
![Page 22: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/22.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
GMProf-Enhanced: Live Range Analysis
22
• The number of accesses to a shared memory location may be misleading
shm_buf in Shared Memory
input_array in Device Memory
data0 data1 data2
output_array in Device Memory
• Need to count the accesses/reuse of DATA, not address
data0
data0 data1 data2
data1data2
![Page 23: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/23.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• Track data during its live range in shared memory
• Use logical clock to marks the boundary of each live range• Separate counters in each live range based on logical clock
GMProf-Enhanced: Live Range Analysis
23
1. ...2. shm_buffer = input_array[0] //load data0 from DM to ShM3. ...4. output_array[0] = shm_buffer //store data0 from ShM to DM5. ...6. ...7. shm_buffer = input_array[1] //load data1 from DM to ShM8. ...9. output_array[1] = shm_buffer //store data1 from ShM to DM10. ...
live range of data0
live range of data1
![Page 24: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/24.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
Outline
• Motivation• GMProf
• Naïve Profiling Approach• Optimizations• Enhanced Algorithm
• Evaluation• Conclusions
24
![Page 25: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/25.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• Platform• GPU: NVIDIA Tesla C1060
• 240 cores (30×8), 1.296GHz• 16KB shared memory per SM• 4GB device memory
• CPU: AMD Opteron 2.6GHz ×2• 8GB main memory • Linux kernel 2.6.32• CUDA Toolkit 3.0
• Six Applications• Co-clustering, EM clustering, Binomial Options, Jacobi, Sparse Matrix-
Vector Multiplication, and DXTC
25
Methodology
![Page 26: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/26.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs 26
Runtime Overhead for Profiling Shared Memory Use
182x 144x 648x 181x113x
2.6x
90x648x
![Page 27: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/27.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs 27
Runtime Overhead for Profiling Device Memory Use
83x 197x 48.5x
1.6x
![Page 28: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/28.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs 28
Case Study I: Put the most frequently used data into shared memory
ProfilingResult GMProf-basic
GMProf
w/o TH w/ TH
ShM 0 0 0
DM
A1(276)A2(276)A3(128)
A4(1)
A1(276)A2(276)A3(128)
A4(1)
A1(THR)A2(THR)A3(128)
A4(1)
• bo_v1: • a naïve implementation where all data arrays are stored in device
memory
A1 ~ A4: four data arrays(N): average access # of the elements in the corresponding data array
![Page 29: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/29.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• bo_v2: • an improved version which puts the most frequently used arrays
(identified by GMProf) into shared memory
29
Case Study I: Put the most frequently used data into shared memory
ProfilingResult GMProf-basic
GMProf
w/o TH w/ TH
ShM A1 (174,788)A2 (169,221)
A1(165,881)A2(160,315)
A1(THR)A2(THR)
DM A3(128)A4(1)
A3(128)A4(1)
A3(128)A4(1)
• bo_v2 outperforms bo_v1 by a factor of 39.63
![Page 30: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/30.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
• jcb_v1: • the shared memory is accessed frequently, but little reuse of the date
30
Case Study II: identify the true reuse of data
ProfilingResult GMProf-basic
GMProf
w/o Enh. Alg. w/ Enh. Alg.
ShM shm_buf (5,760) shm_buf (5,748) shm_buf (2)
DM in(4)out(1)
in(4)out(1)
in(4)out(1)
• jcb_v2 outperforms jcb_v1 by 2.59 times
• jcb_v2:
ProfilingResult GMProf-basic
GMProf
w/o Enh. Alg. w/ Enh. Alg.
ShM shm_buf (4,757) shm_buf (4,741) shm_buf (4)
DM in(1)out(1)
in(1)out(1)
in(1)out(1)
![Page 31: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/31.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
Outline
• Motivation• GMProf
• Naïve Profiling Approach• Optimizations
• Evaluation• Conclusions
31
![Page 32: GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs](https://reader036.fdocuments.us/reader036/viewer/2022062500/56815a14550346895dc75e20/html5/thumbnails/32.jpg)
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs
Conclusions
• GMProf• Statically-assisted dynamic profiling approach• Architecture-based optimizations • Live range analysis to capture real usage of data• Low-overhead & Fine-grained• May be applied to profile other events
32
Thanks!Thanks!