Download - Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Jaekyu Lee Nagesh B. Lakshminarayana Hyesoon Kim Richard Vudu

In the proceedings of the 43rd Annual IEEE/ACM International Symposium

on Microarchitecture (MICRO), December 2010

Paper presentation by Sankalp Shivaprakash

Motivation

• Memory latency hiding through multithread prefetching schemes – Per-warp training and Stride promotion– Inter-thread Prefetching– Adaptive Throttling

• Propose software and hardware prefetching mechanisms for a GPGPU architecture – Scalable to large number of threads– Robustness through feedback and throttling

mechanisms to avoid degraded performance

Memory Latency Hiding techniques

• Multithreading– Thread level and Warp level context switching

• Utilization of complex cache memory hierarchies– Using L1, L2, DRAMs than accessing Global Memory each

time• Prefetching

– Insufficient thread-level parallelism• Memory request merging

Thread1 Thread2 Thread1 Thread3

Prefetching – Parallel Architectures• Reason for prefetching: Consider warp1 and warp2 having

three instructions(Add, Sub, Load)• Without prefetch:

• With prefetch:

– Prefetch1: Fetching for Load2– Prefetch2: Fetching for Load3

Warp1 Warp2

Idle-Load2 for Warp2

Load1 for Warp1

Warp1 Warp2

Load1 for Warp1

Warp3

Prefetching (Contd)

• Software Prefetching– Prefetching into Registers

– Prefetching into Cache• Congestion in Cache if not controlled and accurate• Data could get polluted

Prefetching (Contd)• Hardware Prefetching

– Stream Prefetcher• Monitors the direction of access in a memory region• Once a constant access direction is detected, launch

prefetches in that direction– Stride Prefetcher

• Tracks the difference in address between two accesses• Launches prefetch requests using the delta once a constant

difference is detected– GHB Prefetcher (Global History Buffer)

• Stores miss addresses in an n-entry FIFO table(GHB table)• Each miss address points to another entry(right) which can

detect stream, stride and irregular repeating address patterns

*Characterize Aggressiveness

10002000

010001000

δ= 1000

Many-Thread aware prefetching

• Conventional Stride Prefetching• Inter-thread Prefetching(IP)

MT-SWP


Scalable versions of the traditional training policies, for PC based stride prefetchers

• Per warp training– Strong stride behavior exists within a warp– Stride information trained per warp is stored in a

PWS (Per Warp Stride) Table

MT-HWP


• Stride Promotion– Considering the stride pattern is the same across all

warps for a given PC, PWS is monitored for three accesses

– If found same stride, promote the PWS to Global Stride(GS) table, if not, retain in PWS

• Inter-thread Prefetching– Monitor stride pattern across threads at the same PC,

for 3 memory accesses– If found same, stride information is stored in the IP

table

MT-HWP

Many-Thread aware prefetching• Implementation

• When there are hits in both GS and IP, GS is given preference because– Strides within warp are more common than those across

warps– Trained for a longer period

MT-HWP

Useful vs. Harmful Prefetching

• MTAML-Minimum Tolerable Average Memory Latency– Minimum average number of cycles per memory request that

does not lead to stalls

• MTAML_pref


• Comparison of MTAML and measured average latency (AVG Latency)

12

3

1. AVG Latency < MTAML & AVG Latency(PREF)< MTAML_pref

2. AVG Latency > MTAML: Prefetching beneficial provided AVG Latency (PREF) is less than MTAML_pref

3. Prefetching might turn out useful/Harmful

• Measured AVG Latency(PREF) ignores successively prefetched memory operations

• Greater contention seen when the number of warps increase and delay increased


• Harmful prefetch requests could be due to:– Queuing Delays– DRAM row-buffer conflicts– Wasting of off-chip bandwidth due to early eviction– Wasting of off-chip bandwidth due to inaccurate prefetches

Metrics for Adaptive Prefetch Throttling

• Early Eviction Rate

• Merge Ratio

Avoids :• Consumption of system bandwidth• Delay requests• Occupation of Cache by unnecessary prefetches

Prefetch requests might be late through prefetch merges but that is compensated through context switching across warps

Metrics for Adaptive Prefetch Throttling

• Monitoring of Early Eviction and Merge Ratio

Methodology

• Baseline processor used is NVIDIA’s 8800GT• Applications to simulator is generated using GPUOcelot,

a binary translator framework for PTX

Methodology

Results and Discussion

Conclusion

• The throttling mechanism proposed in this paper is in a way controlling the aggressiveness of prefetching rather than completely curbing it

• The metrics considered were convincing enough to avoid cache pollution due to early eviction and employ memory merging and did not consider accuracy alone

• Scalability and robustness was given importance• The study does not consider complex cache memory

hierarchies• Overhead of prefetching is not clearly substantiated

Thank You