Many-Thread Aware Prefetching Mechanisms for GPGPU Application
description
Transcript of Many-Thread Aware Prefetching Mechanisms for GPGPU Application
![Page 1: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/1.jpg)
Many-Thread Aware Prefetching Mechanisms for GPGPU Application
Jaekyu Lee Nagesh B. Lakshminarayana Hyesoon Kim Richard Vudu
In the proceedings of the 43rd Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO), December 2010
Paper presentation by Sankalp Shivaprakash
![Page 2: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/2.jpg)
Motivation
• Memory latency hiding through multithread prefetching schemes – Per-warp training and Stride promotion– Inter-thread Prefetching– Adaptive Throttling
• Propose software and hardware prefetching mechanisms for a GPGPU architecture – Scalable to large number of threads– Robustness through feedback and throttling
mechanisms to avoid degraded performance
![Page 3: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/3.jpg)
Memory Latency Hiding techniques
• Multithreading– Thread level and Warp level context switching
• Utilization of complex cache memory hierarchies– Using L1, L2, DRAMs than accessing Global Memory each
time• Prefetching
– Insufficient thread-level parallelism• Memory request merging
Thread1 Thread2 Thread1 Thread3
![Page 4: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/4.jpg)
Prefetching – Parallel Architectures• Reason for prefetching: Consider warp1 and warp2 having
three instructions(Add, Sub, Load)• Without prefetch:
• With prefetch:
– Prefetch1: Fetching for Load2– Prefetch2: Fetching for Load3
Warp1 Warp2
Idle-Load2 for Warp2
Load1 for Warp1
Warp1 Warp2
Load1 for Warp1
Warp3
![Page 5: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/5.jpg)
Prefetching (Contd)
• Software Prefetching– Prefetching into Registers
– Prefetching into Cache• Congestion in Cache if not controlled and accurate• Data could get polluted
![Page 6: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/6.jpg)
Prefetching (Contd)• Hardware Prefetching
– Stream Prefetcher• Monitors the direction of access in a memory region• Once a constant access direction is detected, launch
prefetches in that direction– Stride Prefetcher
• Tracks the difference in address between two accesses• Launches prefetch requests using the delta once a constant
difference is detected– GHB Prefetcher (Global History Buffer)
• Stores miss addresses in an n-entry FIFO table(GHB table)• Each miss address points to another entry(right) which can
detect stream, stride and irregular repeating address patterns
*Characterize Aggressiveness
10002000
010001000
δ= 1000
![Page 7: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/7.jpg)
Many-Thread aware prefetching
• Conventional Stride Prefetching• Inter-thread Prefetching(IP)
MT-SWP
![Page 8: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/8.jpg)
Many-Thread aware prefetching
Scalable versions of the traditional training policies, for PC based stride prefetchers
• Per warp training– Strong stride behavior exists within a warp– Stride information trained per warp is stored in a
PWS (Per Warp Stride) Table
MT-HWP
![Page 9: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/9.jpg)
Many-Thread aware prefetching
• Stride Promotion– Considering the stride pattern is the same across all
warps for a given PC, PWS is monitored for three accesses
– If found same stride, promote the PWS to Global Stride(GS) table, if not, retain in PWS
• Inter-thread Prefetching– Monitor stride pattern across threads at the same PC,
for 3 memory accesses– If found same, stride information is stored in the IP
table
MT-HWP
![Page 10: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/10.jpg)
Many-Thread aware prefetching• Implementation
• When there are hits in both GS and IP, GS is given preference because– Strides within warp are more common than those across
warps– Trained for a longer period
MT-HWP
![Page 11: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/11.jpg)
Useful vs. Harmful Prefetching
• MTAML-Minimum Tolerable Average Memory Latency– Minimum average number of cycles per memory request that
does not lead to stalls
• MTAML_pref
![Page 12: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/12.jpg)
Useful vs. Harmful Prefetching
• Comparison of MTAML and measured average latency (AVG Latency)
12
3
1. AVG Latency < MTAML & AVG Latency(PREF)< MTAML_pref
2. AVG Latency > MTAML: Prefetching beneficial provided AVG Latency (PREF) is less than MTAML_pref
3. Prefetching might turn out useful/Harmful
• Measured AVG Latency(PREF) ignores successively prefetched memory operations
• Greater contention seen when the number of warps increase and delay increased
![Page 13: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/13.jpg)
Useful vs. Harmful Prefetching
• Harmful prefetch requests could be due to:– Queuing Delays– DRAM row-buffer conflicts– Wasting of off-chip bandwidth due to early eviction– Wasting of off-chip bandwidth due to inaccurate prefetches
![Page 14: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/14.jpg)
Metrics for Adaptive Prefetch Throttling
• Early Eviction Rate
• Merge Ratio
Avoids :• Consumption of system bandwidth• Delay requests• Occupation of Cache by unnecessary prefetches
Prefetch requests might be late through prefetch merges but that is compensated through context switching across warps
![Page 15: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/15.jpg)
Metrics for Adaptive Prefetch Throttling
• Monitoring of Early Eviction and Merge Ratio
![Page 16: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/16.jpg)
Methodology
• Baseline processor used is NVIDIA’s 8800GT• Applications to simulator is generated using GPUOcelot,
a binary translator framework for PTX
![Page 17: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/17.jpg)
Methodology
![Page 18: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/18.jpg)
Results and Discussion
![Page 19: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/19.jpg)
Results and Discussion
![Page 20: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/20.jpg)
Results and Discussion
![Page 21: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/21.jpg)
Conclusion
• The throttling mechanism proposed in this paper is in a way controlling the aggressiveness of prefetching rather than completely curbing it
• The metrics considered were convincing enough to avoid cache pollution due to early eviction and employ memory merging and did not consider accuracy alone
• Scalability and robustness was given importance• The study does not consider complex cache memory
hierarchies• Overhead of prefetching is not clearly substantiated
![Page 22: Many-Thread Aware Prefetching Mechanisms for GPGPU Application](https://reader035.fdocuments.us/reader035/viewer/2022062321/56813a21550346895da1fded/html5/thumbnails/22.jpg)
Thank You