Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun,...
-
Upload
andre-hilger -
Category
Documents
-
view
221 -
download
1
Transcript of Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun,...
![Page 1: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/1.jpg)
Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems
Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran ChenDepartment of Electrical and Computer Engineering
University of Pittsburgh
![Page 2: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/2.jpg)
• Data prefetching is important
Introduction
Not anySimple Software-aided
ComplicatedHeterogeneousSoftware/hardware
Prefetch data from memory to on-chip cache
![Page 3: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/3.jpg)
• Data prefetching is important
• But far from efficient– Average accuracy seldom
exceeds 60% across most designs– 40% prefetching is useless– Wasted memory bandwidth– Access conflict and cache pollution
• Sensitive to write latency
Introduction
![Page 4: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/4.jpg)
Introduction
• A example—Stream prefetcher in IBM POWER4 microprocessor
Up to 4 prefetchings
prefetcherOne LLC read
……
Keep track of 32 prefetching streams
Each stream can cover the memory address of 64 contiguous cache lines
Prefetch table
A heavy burden for LLC and memory subsystem!
![Page 5: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/5.jpg)
Introduction
• Spin-Transfer Torque Random Access Memory (STT-RAM) as Last-Level Cache (LLC)– High write cost of STT-RAM, ~ 10ns– Combine STT-RAM based LLC with data prefetching– High-density/small-size STT-RAM cell ->bank access conflicts– Low-density/large-size STT-RAM cells->prefetching-incurred cache
pollution
![Page 6: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/6.jpg)
Introduction
• Our work– Reduce the negative impact on system performance of
CMPs with STT-RAM based LLC induced by data prefetching
– Request Prioritization– Hybrid local-global prefetch control
![Page 7: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/7.jpg)
Outline
• Motivation• STT-RAM Basics• Methodology– Request prioritization– Hybrid local-global prefetch control
• Result• Conclusion
![Page 8: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/8.jpg)
STT-RAM Basics
• STT-RAM Cell:– Transistor and MTJ (Magnetic Tunnel Junction);– Denoted as 1T-1J cell.
• MTJ:– Free Layer and Ref. Layer;– Read: Direction → Resistance;– Write: Current → Direction.
Free Layer
Barrier
Reference Layer
Write-1
Write-0
Parallel (RLow), 0
Anti-Parallel (RHigh), 1
![Page 9: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/9.jpg)
STT-RAM Basics
• STT-RAM Cell:– Transistor and MTJ (Magnetic Tunnel Junction);– Denoted as 1T-1J cell.
• MTJ:– Free Layer and Ref. Layer;– Read: Direction → Resistance;– Write: Current → Direction.
• Write latency VS. write currentFree Layer
Barrier
Reference Layer
Write-1
Write-0
1E+00 1E+04 1E+08 1E+120
50100150200250300350400
Chart Title
Switching Time (ns)
Sw
itchi
ng C
urre
nt (
μA
)
![Page 10: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/10.jpg)
Outline
• Motivation• STT-RAM Basics• Methodology– Request Prioritization– Hybrid local-global prefetch control
• Result• Conclusion
![Page 11: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/11.jpg)
bank
CPU0 CPUn
bank
bank bank
memory
Request Prioritization(RP)
• Various LLC access requests– Access conflict– Further aggravated due to long write latency
• LLC access conflict--the source of performance degradation
read
fill prefetch fill
Write back
……
missLLC
conflict
1. Read 2. Fill 3. Write back 4. Prefetch fill
![Page 12: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/12.jpg)
bank
CPU0 CPUn
bank
bank bank
memory
Request Prioritization(RP)
• Various LLC access requests– Access conflict– Further aggravated due to long write latency
• LLC access conflict--the source of performance degradation– Critical request is blocked by non-critical
Request
read
……
LLC
prefetch fill
![Page 13: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/13.jpg)
bank
CPU0 CPUn
bank
bank bank
memory
Request Prioritization(RP)
• Various LLC access requests– Access conflict– Further aggravated due to long write latency
• LLC access conflict--the source of performance degradation– Critical request is blocked by non-critical
Request– Read is the most critical– Fill is slightly less critical than read– Prefetch fill is less critical than read/fill– Write back is the least critical
read
……
LLC
prefetch fill
stall…
fill
miss
Write back
![Page 14: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/14.jpg)
Request Prioritization(RP)
• Assign priority to individual request based on its criticality– Respond to the request based on its priority
• High-priority request is blocked by lowpriority-request
bank
CPU0 CPUn
bank
bank bank
memory
read
……
LLC
time
Prefetch fill
read
prefetch fill
![Page 15: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/15.jpg)
Request Prioritization(RP)
• Assign priority to individual request based on its criticality– Respond to the request based on its priority
• High-priority request is blocked by lowpriority-request– Retirement accomplishment degree (RAD)– Determines at what stage a low-priority request can be preempted by high-priority request
preempted< RAD
time
Prefetch fill
read
bank
CPU0 CPUn
bank
bank bank
memory
read
……
LLC
prefetch fill
![Page 16: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/16.jpg)
Outline
• Motivation• STT-RAM Basics• Methodology– Request prioritization– Hybrid local-global prefetch control
• Result• Conclusion
![Page 17: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/17.jpg)
Hybrid Local-global Prefetch Control (HLGPC)
• The prefetching efficiency is affected by the capacity (cell size) of STT-RAM based LLC– A large cell size alleviates bank conflict, by reducing the blocking
time of write operations– Cache pollution incurred by prefetching also becomes severer
due to the reduced total capacity
• Prefetch control considering LLC access contention– Dynamically control the aggressiveness of the prefetchers– Tune the prefetch distance/prefetch degree
![Page 18: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/18.jpg)
Hybrid Local-global Prefetch Control (HLGPC)
• Local (per core) prefetch control (LPC)– Focuses on maximizing the performance of each CPU core
• Hybrid Local-global Prefetch Control (HLGPC)– Achieve a balanced dynamic aggressiveness control– At core-level, feedback directed prefetching (FDP)* as the LPC– At chip-level, GPC may retain or override the decision of LPC based on
the runtime information of LLC
CaseCore i’s runtime information LLC’s info
DicisionPref. Accuracy Pref. Frequency LLC acc. Freq.
1 Low High High Force scale down2 High High High Disable scale up3 Low High Low Disable scale up4-8 - - - Allow local
decision
![Page 19: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/19.jpg)
Hybrid Local-global Prefetch Control (HLGPC)
• Local (per core) prefetch control (LPC)– Focuses on maximizing the performance of each CPU core
• Hybrid Local-global Prefetch Control (HLGPC)– Achieve a balanced dynamic aggressiveness control– At core-level, feedback directed prefetching (FDP)* as the LPC– At chip-level, GPC may retain or override the decision of LPC based on
the runtime information of LLC
• Sampling runtime information– Dynamically identify the execution phase of an app
• UpateCounti = 0.5 UpdateCounti-1 + 0.5 CurrentCounti
execution ii - 1
![Page 20: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/20.jpg)
Outline
• Motivation• STT-RAM Basics• Methodology– Request prioritization– Hybrid local-global prefetch control
• Result• Conclusion
![Page 21: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/21.jpg)
Experiment Methodology
• Workload construction– 2 SPEC CPU 2000 & 16 SPEC CPU 2006– 6 4-app workloads: at least 2 memory intensive, 1 prefetch intensive, 1
memory non-intensive
CPU1 CPU2 CPU3CPU0
STT-RAMBased LLC
Ring NOC
200 cycles to memory
32KB D/I L1
256 KB L2, 4 banks
IBM POWER4 stream prefetcher
Out-Of-Order, 4 issues
2/4/8 MB, 8 banks, 128 MSHRs
Read: 12/16/12 cycles
Write: 20/48/128 cycles
![Page 22: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/22.jpg)
• Three priority assignments– From P2 to P4, the aggressiveness
of read request goes up
• Performance improvement– Prioritizing LLC access requests always achieve system performance
improvement– Achieves more substantial performance improvement for large LLC with
long write access latency– The highest performance
is achieved by P1
Enhancement by RP
priority
Read & fill
Pref. fill
wb
P1
60%60%
Read
fill
Pref. fill
P2
60%wb60%
60%
25%
Read
fill
Pref. fill
P3
60%wb60%
100%
![Page 23: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/23.jpg)
Effectiveness of HLGPC
• Performance improvement– LPC alone achieves little im-
provement– HLGPC alone is better– HLGPC+P1 achieves the
highest improvement
• The total energy consumption is further reduced– Further decrease of dynamic energy– Much shorter execution time leads tocontinuously reduction of leakage energy– EDP is improved by 7.8%
1st: P1; 2nd: LPC+P1; 3rd: HLGPC+P1
![Page 24: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/24.jpg)
Outline
• Motivation• STT-RAM Basics• Methodology– Request prioritization– Hybrid local-global prefetch control
• Result• Conclusion
![Page 25: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/25.jpg)
Conclusion• In CMP systems with aggressive prefetching, STT-RAM based LLC
suffers from increased LLC cache latency due to higher write pressure and cache pollution
• Request prioritization can significantly mitigate the negative impact induced by bank conflicts on large LLC
• Coupling GPC and LPC can alleviate the cache pollution on small LLC• RP+HLGPC unveils the performance potential of prefetching in CMP
systems• System performance can be improved by 9.1%, 6.5%, and 11.0% for
2MB, 4MB, and 8MB STT-RAM LLCs; the corresponding LLC energy consumption is also saved by 7.3%, 4.8%, and 5.6%, respectively.
![Page 26: Prefetching Techniques for STT-RAM based Last-level Cache in CMP Systems Mengjie Mao, Guangyu Sun, Yong Li, Kai Bu, Alex K. Jones, Yiran Chen Department.](https://reader035.fdocuments.us/reader035/viewer/2022062515/56649c775503460f9492b8ea/html5/thumbnails/26.jpg)
Q&A
• Thank you