Transcript of ARI. HiPEAK 2014
- 1. Viacheslav Fedorov, Sheng Qiu, Narasimha Reddy, Paul Gratz
Texas A&M University ARI: Adaptive Replacement and Insertion
HiPEAC 2013, Vienna, Austria
- 2. Conventional Main Memory Usually we only care about speeding
up the cache miss path Main Memory Core 0 Core 1 Core 2 Core 3 L3$
L2$ L2$
- 3. Main Memory: Trends New Memories emerging DRAM not dense
enough Replace or augment DRAM DRAM Core 0 Core 1 Core 2 Core 3 L3$
L2$ L2$ DRAM PCM DRAM cache
- 4. PCM Technology Based on Chalcogenide glass Exploits two
phases Amorphous Chrystalline Higher density than DRAM Non-volatile
Image: Stanford NanoHeat Lab
- 5. DRAM vs PCM DRAM is writeback-agnostic Write Buffers cushion
the impact of writebacks State-of-the-art policies target cache
misses PCM High write latency Write Buffers insufficient High write
energy Mobile, embedded devices ? Low cell endurance Limited write
cycles ? Parameter DRAM PCM Row Read 210 mW 78 mW Row Write 195 mW
773 mW Activate 75 mW 25 mW Standby 90 mW 45 mW Refresh 4 mW 0 mW
Initial Row Read 15 ns 28 ns Row Write 22 ns 150 ns Same Row R/W 15
ns 15 ns 0.3x 4x 0.3x 0.5x 7x 2x 0x
- 6. Outline Introduction Motivation ARI: Adaptive Replacement
and Insertion Evaluation Summary Conclusion
- 7. Motivation PCM is attractive as a Main Memory, but... PCM
does not favor writes High energy High latency Low write cycle
tolerance Solution: reduce writes into Main Memory Modify LLC
policies to reduce Writebacks Mind the Miss rate!
- 8. Application behavior in High-Associativity Caches Bi-Polar
block distribution due to LRU policy 'Hot' blocks tend to group
towards MRU side 'Cold' blocks towards LRU side in a set Hot blocks
have higher Hit-ratio Cold blocks tend to have similar Hit-ratios
%hitrate Position in LRU stackMRU LRU 'Hot' region 'Cold' region
Hit distribution in a high-associativity cache (16-way)
- 9. Static LLC policies Based on the observed hot-cold
distribution 16-way cache: 16 static policies, xH16 Replace any
clean block in (16-x) Low-hit blocks Drawbacks: No single static
policy good for all applications Less writebacks => more cache
misses When replacing hot blocks
- 10. Enter ARI: Adaptive Replacement and Insertion Goal: Reduce
LLC writebacks ! Keep miss rate lower than conventional policies
How? Do not replace dirty cache blocks (as long as possible) Place
fresh incoming blocks into LLC smartly Dynamically choose the best
policy
- 11. ARI: Operation Evict clean blocks from Low-Hit region
Insert new blocks into top of Low-Hit region %hitrate Position in
LRU stackMRU LRU High-Hit region Low-Hit region
- 12. ARI: Operation Application hit-distributions are not static
Dynamic policy adaptation based on epochs Emulate various static
thresholds in LLC tags Pick the best one for next epoch (25k LLC
accesses) Misses + Writebacks metric used %hitrate MRU LRU
- 13. Core 0 Core 1 Core 2 Core 3 L3$ L2$ L2$ ARI: Implementation
Emulate static thresholds in shadow tags Adapt to the
hit-distribution Tag Array Data ArrayShadow Tag Array dynamically
4H16 10H16 14H16
- 14. Outline Introduction Motivation ARI: Adaptive Replacement
and Insertion Evaluation Summary Conclusion
- 15. Methodology gem5 + DRAMSim2 simulators nVidia Tegra -like
out-of-order, dual-issue CPU SPEC2006 and PARSEC suites Compared
against state-of-the-art policies ARI beats them in writeback
reduction Nearly identical in total performance System Single core
Multicore L1 cache 32KB I + 64KB D, 2-way, LRU, 64B block 32KB I +
64KB D, 2-way, LRU, 64B block L2 cache 256KB, 8-way, LRU, 64B block
256KB, 8-way, LRU, 64B block (private) L3 cache 2MB, 16-way, LRU,
64B block 16MB, 16-way, LRU, 64B block (shared) Main memory 4GB,
DDR3-1333 DRAM, 32-entry write buffer 4GB, DDR3-1333 DRAM, 32-entry
write buffer
- 16. ARI: Writeback reduction ARI beats the competition: 33% WB
reduction Writeback improvement, normalized to LRU policy DIP: M.
Qureshi et al, ISCA '09 DBLK: S. Khan et al, MICRO '10 RRIP: A.
Jaleel et al, ISCA '10
- 17. ARI: Miss reduction ARI achieves 4.7% Misses reduction Miss
rate improvement, normalized to LRU policy DIP: M. Qureshi et al,
ISCA '09 DBLK: S. Khan et al, MICRO '10 RRIP: A. Jaleel et al, ISCA
'10
- 18. ARI: Performance improvement ARI yields a 5% IPC
improvement on average IPC improvement, normalized to LRU
policy
- 19. ARI: Dynamic behavior ARI adapts to program phases Achieves
lower WBs than the best static policy Soplex application, SPEC
2006mcf application, SPEC 2006 Writebacks
- 20. ARI: Multicore applications
- 21. ARI: PCM lifetime improvement ARI facilitates the use of
PCM as Main Memory DIP DBLK RRIP ARI 0% 10% 20% 30% 40% 50% 60%
%PCMlifetimeimprovement Decrease lifetime for several apps
- 22. ARI: PCM lifetime improvement
- 23. ARI: Hardware overhead 8 sets shadowed per LLC bank (x8)
p*2 shadow tags (we use p=9) 14kB storage overhead in a 16MB LLC
Epoch counter 15 bits Performance counters, adders Not on critical
path Can be designed for low power
- 24. Outline Introduction Motivation ARI: Adaptive Replacement
and Insertion Evaluation Summary Conclusion
- 25. ARI: Summary 33% writeback reduction 4.7% cache miss rate
reduction 9% less Main Memory traffic System IPC boost of 5%
Enabling PCM as Main Memory 50% lifetime improvement Win Win
- 26. Conclusion DRAM is hitting a scalability wall New
memories/architectures proposed We target PCM as main memory
Propose ARI: Adaptive Replacement and Insertion Simple scheme
Reduce writebacks to main memory Boost the PCM performance and
lifetime
- 27. Thank you! Questions?..
- 28. Backup Slides
- 29. Related Work: PCM G. Dhiman et al. PDRAM: A hybrid PRAM and
DRAM main memory system. DAC 09 M. K. Qureshi et al. Enhancing
Lifetime and Security of PCM-based Main Memory with Start-Gap Wear
Leveling. MICRO 09 B. C. Lee et al. Architecting Phase Change
Memory as a Scalable DRAM Alternative. ISCA 09 M. K. Qureshi et al.
Scalable high performance main memory system using phase-change
memory technology. ISCA 09 A. P. Ferreira et al. Increasing PCM
main memory lifetime. DATE 10
- 30. Related Work: PCM N. H. Seong et al. Security refresh:
prevent malicious wear-out and increase durability for phase-change
memory with dynamically randomized address mapping. ISCA 10 H. Yoon
et al. Row buffer locality aware caching policies for hybrid
memories. ICCD 12 Stuecheli et al. The Virtual Write Queue:
Coordinating DRAM and Last-Level Cache Policies. ISCA 10 M. K.
Qureshi & G. H. Loh Fundamental latency trade-off in
architecting dram caches: Outperforming impractical SRAM-tags with
a simple and practical design. MICRO 12
- 31. ARI: Insertion impact
- 32. ARI: Total Memory Traffic gcc bzip bwaves mcf milc zeus
gromacs cactusADMleslie3d namd gobmk soplex hmmer sjeng
GemsFDTDh264ref astar sphinx3 avg 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3
1.4 Total memory traffic, Misses + Writebacks. Normalized to LRU
4H16 ARI TotaltrafficnormalizedtoLRU