ARI. HiPEAK 2014

download ARI. HiPEAK 2014

If you can't read please download the document

Transcript of ARI. HiPEAK 2014

  1. 1. Viacheslav Fedorov, Sheng Qiu, Narasimha Reddy, Paul Gratz Texas A&M University ARI: Adaptive Replacement and Insertion HiPEAC 2013, Vienna, Austria
  2. 2. Conventional Main Memory Usually we only care about speeding up the cache miss path Main Memory Core 0 Core 1 Core 2 Core 3 L3$ L2$ L2$
  3. 3. Main Memory: Trends New Memories emerging DRAM not dense enough Replace or augment DRAM DRAM Core 0 Core 1 Core 2 Core 3 L3$ L2$ L2$ DRAM PCM DRAM cache
  4. 4. PCM Technology Based on Chalcogenide glass Exploits two phases Amorphous Chrystalline Higher density than DRAM Non-volatile Image: Stanford NanoHeat Lab
  5. 5. DRAM vs PCM DRAM is writeback-agnostic Write Buffers cushion the impact of writebacks State-of-the-art policies target cache misses PCM High write latency Write Buffers insufficient High write energy Mobile, embedded devices ? Low cell endurance Limited write cycles ? Parameter DRAM PCM Row Read 210 mW 78 mW Row Write 195 mW 773 mW Activate 75 mW 25 mW Standby 90 mW 45 mW Refresh 4 mW 0 mW Initial Row Read 15 ns 28 ns Row Write 22 ns 150 ns Same Row R/W 15 ns 15 ns 0.3x 4x 0.3x 0.5x 7x 2x 0x
  6. 6. Outline Introduction Motivation ARI: Adaptive Replacement and Insertion Evaluation Summary Conclusion
  7. 7. Motivation PCM is attractive as a Main Memory, but... PCM does not favor writes High energy High latency Low write cycle tolerance Solution: reduce writes into Main Memory Modify LLC policies to reduce Writebacks Mind the Miss rate!
  8. 8. Application behavior in High-Associativity Caches Bi-Polar block distribution due to LRU policy 'Hot' blocks tend to group towards MRU side 'Cold' blocks towards LRU side in a set Hot blocks have higher Hit-ratio Cold blocks tend to have similar Hit-ratios %hitrate Position in LRU stackMRU LRU 'Hot' region 'Cold' region Hit distribution in a high-associativity cache (16-way)
  9. 9. Static LLC policies Based on the observed hot-cold distribution 16-way cache: 16 static policies, xH16 Replace any clean block in (16-x) Low-hit blocks Drawbacks: No single static policy good for all applications Less writebacks => more cache misses When replacing hot blocks
  10. 10. Enter ARI: Adaptive Replacement and Insertion Goal: Reduce LLC writebacks ! Keep miss rate lower than conventional policies How? Do not replace dirty cache blocks (as long as possible) Place fresh incoming blocks into LLC smartly Dynamically choose the best policy
  11. 11. ARI: Operation Evict clean blocks from Low-Hit region Insert new blocks into top of Low-Hit region %hitrate Position in LRU stackMRU LRU High-Hit region Low-Hit region
  12. 12. ARI: Operation Application hit-distributions are not static Dynamic policy adaptation based on epochs Emulate various static thresholds in LLC tags Pick the best one for next epoch (25k LLC accesses) Misses + Writebacks metric used %hitrate MRU LRU
  13. 13. Core 0 Core 1 Core 2 Core 3 L3$ L2$ L2$ ARI: Implementation Emulate static thresholds in shadow tags Adapt to the hit-distribution Tag Array Data ArrayShadow Tag Array dynamically 4H16 10H16 14H16
  14. 14. Outline Introduction Motivation ARI: Adaptive Replacement and Insertion Evaluation Summary Conclusion
  15. 15. Methodology gem5 + DRAMSim2 simulators nVidia Tegra -like out-of-order, dual-issue CPU SPEC2006 and PARSEC suites Compared against state-of-the-art policies ARI beats them in writeback reduction Nearly identical in total performance System Single core Multicore L1 cache 32KB I + 64KB D, 2-way, LRU, 64B block 32KB I + 64KB D, 2-way, LRU, 64B block L2 cache 256KB, 8-way, LRU, 64B block 256KB, 8-way, LRU, 64B block (private) L3 cache 2MB, 16-way, LRU, 64B block 16MB, 16-way, LRU, 64B block (shared) Main memory 4GB, DDR3-1333 DRAM, 32-entry write buffer 4GB, DDR3-1333 DRAM, 32-entry write buffer
  16. 16. ARI: Writeback reduction ARI beats the competition: 33% WB reduction Writeback improvement, normalized to LRU policy DIP: M. Qureshi et al, ISCA '09 DBLK: S. Khan et al, MICRO '10 RRIP: A. Jaleel et al, ISCA '10
  17. 17. ARI: Miss reduction ARI achieves 4.7% Misses reduction Miss rate improvement, normalized to LRU policy DIP: M. Qureshi et al, ISCA '09 DBLK: S. Khan et al, MICRO '10 RRIP: A. Jaleel et al, ISCA '10
  18. 18. ARI: Performance improvement ARI yields a 5% IPC improvement on average IPC improvement, normalized to LRU policy
  19. 19. ARI: Dynamic behavior ARI adapts to program phases Achieves lower WBs than the best static policy Soplex application, SPEC 2006mcf application, SPEC 2006 Writebacks
  20. 20. ARI: Multicore applications
  21. 21. ARI: PCM lifetime improvement ARI facilitates the use of PCM as Main Memory DIP DBLK RRIP ARI 0% 10% 20% 30% 40% 50% 60% %PCMlifetimeimprovement Decrease lifetime for several apps
  22. 22. ARI: PCM lifetime improvement
  23. 23. ARI: Hardware overhead 8 sets shadowed per LLC bank (x8) p*2 shadow tags (we use p=9) 14kB storage overhead in a 16MB LLC Epoch counter 15 bits Performance counters, adders Not on critical path Can be designed for low power
  24. 24. Outline Introduction Motivation ARI: Adaptive Replacement and Insertion Evaluation Summary Conclusion
  25. 25. ARI: Summary 33% writeback reduction 4.7% cache miss rate reduction 9% less Main Memory traffic System IPC boost of 5% Enabling PCM as Main Memory 50% lifetime improvement Win Win
  26. 26. Conclusion DRAM is hitting a scalability wall New memories/architectures proposed We target PCM as main memory Propose ARI: Adaptive Replacement and Insertion Simple scheme Reduce writebacks to main memory Boost the PCM performance and lifetime
  27. 27. Thank you! Questions?..
  28. 28. Backup Slides
  29. 29. Related Work: PCM G. Dhiman et al. PDRAM: A hybrid PRAM and DRAM main memory system. DAC 09 M. K. Qureshi et al. Enhancing Lifetime and Security of PCM-based Main Memory with Start-Gap Wear Leveling. MICRO 09 B. C. Lee et al. Architecting Phase Change Memory as a Scalable DRAM Alternative. ISCA 09 M. K. Qureshi et al. Scalable high performance main memory system using phase-change memory technology. ISCA 09 A. P. Ferreira et al. Increasing PCM main memory lifetime. DATE 10
  30. 30. Related Work: PCM N. H. Seong et al. Security refresh: prevent malicious wear-out and increase durability for phase-change memory with dynamically randomized address mapping. ISCA 10 H. Yoon et al. Row buffer locality aware caching policies for hybrid memories. ICCD 12 Stuecheli et al. The Virtual Write Queue: Coordinating DRAM and Last-Level Cache Policies. ISCA 10 M. K. Qureshi & G. H. Loh Fundamental latency trade-off in architecting dram caches: Outperforming impractical SRAM-tags with a simple and practical design. MICRO 12
  31. 31. ARI: Insertion impact
  32. 32. ARI: Total Memory Traffic gcc bzip bwaves mcf milc zeus gromacs cactusADMleslie3d namd gobmk soplex hmmer sjeng GemsFDTDh264ref astar sphinx3 avg 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4 Total memory traffic, Misses + Writebacks. Normalized to LRU 4H16 ARI TotaltrafficnormalizedtoLRU