Many-Thread Aware Prefetching Mechanisms for GPGPU Application
Jaekyu Lee Nagesh B. Lakshminarayana Hyesoon Kim Richard Vudu
In the proceedings of the 43rd Annual IEEE/ACM International Symposium
on Microarchitecture (MICRO), December 2010
Paper presentation by Sankalp Shivaprakash
Motivation
• Memory latency hiding through multithread prefetching schemes – Per-warp training and Stride promotion– Inter-thread Prefetching– Adaptive Throttling
• Propose software and hardware prefetching mechanisms for a GPGPU architecture – Scalable to large number of threads– Robustness through feedback and throttling
mechanisms to avoid degraded performance
Memory Latency Hiding techniques
• Multithreading– Thread level and Warp level context switching
• Utilization of complex cache memory hierarchies– Using L1, L2, DRAMs than accessing Global Memory each
time• Prefetching
– Insufficient thread-level parallelism• Memory request merging
Thread1 Thread2 Thread1 Thread3
Prefetching – Parallel Architectures• Reason for prefetching: Consider warp1 and warp2 having
three instructions(Add, Sub, Load)• Without prefetch:
• With prefetch:
– Prefetch1: Fetching for Load2– Prefetch2: Fetching for Load3
Warp1 Warp2
Idle-Load2 for Warp2
Load1 for Warp1
Warp1 Warp2
Load1 for Warp1
Warp3
Prefetching (Contd)
• Software Prefetching– Prefetching into Registers
– Prefetching into Cache• Congestion in Cache if not controlled and accurate• Data could get polluted
Prefetching (Contd)• Hardware Prefetching
– Stream Prefetcher• Monitors the direction of access in a memory region• Once a constant access direction is detected, launch
prefetches in that direction– Stride Prefetcher
• Tracks the difference in address between two accesses• Launches prefetch requests using the delta once a constant
difference is detected– GHB Prefetcher (Global History Buffer)
• Stores miss addresses in an n-entry FIFO table(GHB table)• Each miss address points to another entry(right) which can
detect stream, stride and irregular repeating address patterns
*Characterize Aggressiveness
10002000
010001000
δ= 1000
Many-Thread aware prefetching
• Conventional Stride Prefetching• Inter-thread Prefetching(IP)
MT-SWP
Many-Thread aware prefetching
Scalable versions of the traditional training policies, for PC based stride prefetchers
• Per warp training– Strong stride behavior exists within a warp– Stride information trained per warp is stored in a
PWS (Per Warp Stride) Table
MT-HWP
Many-Thread aware prefetching
• Stride Promotion– Considering the stride pattern is the same across all
warps for a given PC, PWS is monitored for three accesses
– If found same stride, promote the PWS to Global Stride(GS) table, if not, retain in PWS
• Inter-thread Prefetching– Monitor stride pattern across threads at the same PC,
for 3 memory accesses– If found same, stride information is stored in the IP
table
MT-HWP
Many-Thread aware prefetching• Implementation
• When there are hits in both GS and IP, GS is given preference because– Strides within warp are more common than those across
warps– Trained for a longer period
MT-HWP
Useful vs. Harmful Prefetching
• MTAML-Minimum Tolerable Average Memory Latency– Minimum average number of cycles per memory request that
does not lead to stalls
• MTAML_pref
Useful vs. Harmful Prefetching
• Comparison of MTAML and measured average latency (AVG Latency)
12
3
1. AVG Latency < MTAML & AVG Latency(PREF)< MTAML_pref
2. AVG Latency > MTAML: Prefetching beneficial provided AVG Latency (PREF) is less than MTAML_pref
3. Prefetching might turn out useful/Harmful
• Measured AVG Latency(PREF) ignores successively prefetched memory operations
• Greater contention seen when the number of warps increase and delay increased
Useful vs. Harmful Prefetching
• Harmful prefetch requests could be due to:– Queuing Delays– DRAM row-buffer conflicts– Wasting of off-chip bandwidth due to early eviction– Wasting of off-chip bandwidth due to inaccurate prefetches
Metrics for Adaptive Prefetch Throttling
• Early Eviction Rate
• Merge Ratio
Avoids :• Consumption of system bandwidth• Delay requests• Occupation of Cache by unnecessary prefetches
Prefetch requests might be late through prefetch merges but that is compensated through context switching across warps
Metrics for Adaptive Prefetch Throttling
• Monitoring of Early Eviction and Merge Ratio
Methodology
• Baseline processor used is NVIDIA’s 8800GT• Applications to simulator is generated using GPUOcelot,
a binary translator framework for PTX
Methodology
Results and Discussion
Results and Discussion
Results and Discussion
Conclusion
• The throttling mechanism proposed in this paper is in a way controlling the aggressiveness of prefetching rather than completely curbing it
• The metrics considered were convincing enough to avoid cache pollution due to early eviction and employ memory merging and did not consider accuracy alone
• Scalability and robustness was given importance• The study does not consider complex cache memory
hierarchies• Overhead of prefetching is not clearly substantiated
Thank You
Top Related