Designing On-chip Memory Systems for Throughput Architectures Ph.D. Proposal Jeff Diamond Advisor:...

78
Designing On-chip Memory Systems for Throughput Architectures Ph.D. Proposal Jeff Diamond Advisor: Stephen Keckler

Transcript of Designing On-chip Memory Systems for Throughput Architectures Ph.D. Proposal Jeff Diamond Advisor:...

  • Slide 1
  • Designing On-chip Memory Systems for Throughput Architectures Ph.D. Proposal Jeff Diamond Advisor: Stephen Keckler
  • Slide 2
  • Turning to Heterogeneous Chips 1 AMD - TRINITY Intel Ivy Bridge Well be seeing a lot more than 2-4 cores per chip really quickly Bill Mark, 2005 nVIDIA Tegra 3
  • Slide 3
  • Talk Outline Introduction The Problem Throughput Architectures Dissertation Goals The Solution Modeling Throughput Performance Architectural Enhancements Thread Scheduling Cache Policies Methodology Proposed Work 2
  • Slide 4
  • Throughput Architectures (TA) Key Features: Use explicit parallelism to break application into threads Optimize hardware for performance density, not single thread performance Benefits: Drop voltage, peak frequency quadratic improvement in power efficiency Cores smaller, more energy efficient Less need for OO, register renaming, branch prediction, fast synchronization, low latency ALUs Further economize through multithreading each core Amortize expense using SIMD 3
  • Slide 5
  • Scope Highly Threaded TA Architecture Continuum: Multithreading Large number of threads mask long latency Small amount of cache primarily for bandwidth Caching Large amounts of cache to reduce latency Small number of threads Can we get benefits of both? 4 Power 7 4 threads/core ~1MB/thread SPARC T4 8 threads/core ~80KB/thread GTX 580 48 threads/core ~2KB/thread
  • Slide 6
  • Problem - Technology Mismatch Computation is cheap, data movement is expensive Hit in L1 cache, 2.5x power of 64-bit FMADD Move across chip, 50x power Fetch from DRAM, 320x power Exponential growth in cores saturates off- chip bandwidth Performance capped Latency to off-chip DRAM now hundreds of cycles Need hundreds of threads in flight to cover latency 5
  • Slide 7
  • The Downward Spiral Littles Law Threads needed is proportional to average latency On-chip resources: opportunity cost Thread contexts In flight memory accesses Too many threads negative feedback Adding threads to cover latency increases latency Slower register access, thread scheduling Reduced Locality Reduces bandwidth and DRAM efficiency Reduces effectiveness of caching Parallel starvation 6
  • Slide 8
  • Talk Outline Introduction The Problem Throughput Architectures Dissertation Goals The Solution Modeling Throughput Performance Architectural Enhancements Thread Scheduling Cache Policies Methodology Proposed Work 7
  • Slide 9
  • Goal: Increase Parallel Efficiency Problem: Too Many Threads! Increase Parallel Efficiency, i.e. Number of threads needed to achieve a given level of performance Improves throughput performance Apply low latency caches Leverage upwards spiral Difficult to mix multithreading and caching Typically used just for bandwidth amplification Important factors Thread scheduling Instruction Scheduling (per thread parallelism) 8
  • Slide 10
  • Contributions Quantifying the impact of single thread performance on throughput performance Developing a mathematical analysis of throughput performance Building a novel hybrid-trace based simulation infrastructure Demonstrating unique architectural enhancements in thread scheduling and cache policies 9
  • Slide 11
  • Talk Outline Introduction The Problem Throughput Architectures Dissertation Goals The Solution Modeling Throughput Performance Architectural Enhancements Thread Scheduling Cache Policies Methodology Proposed Work 10
  • Slide 12
  • Talk Outline Introduction The Problem Throughput Architectures Dissertation Goals The Solution Modeling Throughput Performance Cache Performance The Valley Architectural Enhancements Thread Throttling Cache Policies Methodology Proposed Work 11
  • Slide 13
  • Mathematical Analysis Why take a mathematical approach? Be very precise about what we want to optimize Understand the relationship and sensitivities to throughput performance: Single thread performance Cache improvements Application characteristics Suggest most fruitful architectural improvements 12
  • Slide 14
  • Modeling Throughput Performance 13 N T = Total Active Threads P CHIP = Total Throughput Performance P ST = Single Thread Performance L AVG = Average Latency per instruction Power CHIP = E AVG (Joules)xP CHIP
  • Slide 15
  • Cache As A Performance Unit 14 Key: How much does a thread need? FMADD SRAM Area: 2-11KB SRAM, 8-40KB eDRAM Shared through pipelining Active Power: 20pJ / Op Leakage Power: 1 watt/mm^2 Active Power: 50pJ/L1 access, 1.1nJ/L2 access Leakage Power: 70 milliwatts/mm^2 1.4 watts/MB eDRAM 350milliwatts/MB Make loads 150x faster, 300x more energy efficient Use10-15x less power/mm^2 than FPUs One FPU = ~64KB SRAM / 256KB eDRAM
  • Slide 16
  • Performance From Caching Ignore changes to DRAM latency & off-chip BW We will simulate these Assume ideal caches What is the maximum performance benefit? 15 Memory Intensity, M=1-A N T = Total active threads on chip A = Arithmetic intensity of application (fraction of non-memory instructions) L = Average latency per instruction For power, replace L with E, the average energy per instruction Qualitatively identical, but differences more dramatic
  • Slide 17
  • Ideal Cache = Frequency Cache Hit rate depends on amount of cache, application working set Store items used the most times This is the concept of frequency Once we know an applications memory access characteristics, we can model throughput performance 16
  • Slide 18
  • Modeling Cache Performance 17
  • Slide 19
  • Performance Per Thread P S (t) is a steep reciprocal 18
  • Slide 20
  • Talk Outline Introduction The Problem Throughput Architectures Dissertation Goals The Solution Modeling Throughput Performance Cache Performance The Valley Architectural Enhancements Thread Throttling Cache Policies Methodology Proposed Work 19
  • Slide 21
  • Valley in Cache Space X = 20
  • Slide 22
  • Valley Annotated Cache Regime MT Regime Valley Width 21 No Cache Cache
  • Slide 23
  • Prior Work Hong et al, 2009, 2010 Simple, cacheless GPU models Used to predict MT peak Guz et al, 2008, 2010 Graphed throughput performance with assumed cache profile Identified valley structure Validated against PARSEC benchmarks No mathematical analysis Didnt analysis bandwidth limited regime Focus on CMP benchmarks Galal et al, 2011 Excellent mathematical analysis Focused on FPU+Register design 22
  • Slide 24
  • Valley Annotated Cache Regime MT Regime Valley Width 23 No Cache Cache
  • Slide 25
  • Energy vs Latency 24 * Bill Dally, IPDPS Keynote, 2011
  • Slide 26
  • Valley Energy Efficiency 25
  • Slide 27
  • Talk Outline Introduction The Problem Throughput Architectures Dissertation Goals The Solution Modeling Throughput Performance Cache Performance The Valley Architectural Enhancements Thread Throttling Cache Policies Methodology Proposed Work 26
  • Slide 28
  • Talk Outline Introduction The Problem Throughput Architectures Dissertation Goals The Solution Modeling Throughput Performance Cache Performance The Valley Architectural Enhancements Thread Throttling Cache Policies Methodology Proposed Work 27
  • Slide 29
  • Thread Throttling Have real time information Arithmetic Intensity Bandwidth Utilization Current Hit Rate Can match to approximate/conservative locality Approximate optimum operating points Shut down / Activate threads to increase performance Concentrate power and overclock 28
  • Slide 30
  • Prior Work Several studies in CMP and GPU area scale back threads CMP When miss rates get too high GPU When off-chip bandwidth is saturated Prior attempts simple, unidirectional We have two complex points to hit, three different operating regimes Mathematical analysis lets us approximate both points with as little as two samples Both off-chip bandwidth and 1/Hitrate are nearly linear for a wide range of applications 29
  • Slide 31
  • Talk Outline Introduction The Problem Throughput Architectures Dissertation Goals The Solution Modeling Throughput Performance Cache Performance The Valley Architectural Enhancements Thread Throttling Cache Policies (Indexing, replacement) Methodology Proposed Work 30
  • Slide 32
  • Mathematical Analysis Need to work like LFU cache Hard to implement in practice Still very little cache per thread Policies make big differences for small caches Associativity a big issue Cannot cache every line referenced Beyond dead line prediction Stream lines with lower reuse 31
  • Slide 33
  • Cache Conflict Misses Different addresses map to same way Programmers prefer power of 2 array sizes Power of 2 strides pathological Prime number of banks/sets thought ideal No efficient implementation Mersenne Primes not so convenient: 3, 7, 15, 31, 63, 127, 255 Early paper on prime strides for vector computers showed 3x speedup Kharbutli, HPCA 04 showed prime sets as hash function for caches worked well Odd-sets work as well Fastest implementation of DIV-MOD Silver Bullet, e.g., allowed banks with same conflict rate 32
  • Slide 34
  • Early Study using PARSEC PARSEC L2 with 64 threads 33
  • Slide 35
  • (Re)placement Policies Not all data should be cached Recent papers for LLC caches Hard drive cache algorithms Frequency over Recency Frequency hard to implement ARC good compromise Direct Mapping Replacement dominates Look for explicit approaches Priority Classes Epochs 34
  • Slide 36
  • Prior Work Belady solved it all, light on implementation details Three hierarchies of methods Best one utilized information of prior line usage Approximations ARC cache ghost entries, recency and frequency groups Generational Caches, multiqueue Qureshi, 2006, 2007 Adaptive Insertion policies 35
  • Slide 37
  • Talk Outline Introduction The Problem Throughput Architectures Dissertation Goals The Solution Modeling Throughput Performance Cache Performance The Valley Architectural Enhancements Thread Throttling Cache Policies (Indexing, replacement) Methodology (Applications, Simulation) Proposed Work 36
  • Slide 38
  • Benchmarks Initially studied regular HPC kernels/applications in CMP environment Dense Matrix Multiply Fast Fourier Transform Homme weather simulation Added CUDA throughput benchmarks Parboil old school MPI, coarse grained Rodinia fine grained, varied Benchmarks typical of historical GPGPU applications Will add irregular benchmarks SparseMM, Adaptive Finite Elements, Photon mapping 37
  • Slide 39
  • Subset of Benchmarks 38
  • Slide 40
  • Preliminary Results Most of the benchmarks should benefit: Small working sets Concentrated working sets Hit rate curves easy to predict 39
  • Slide 41
  • Typical Concentration of Locality 40
  • Slide 42
  • Scratchpad Locality 41
  • Slide 43
  • Hybrid Simulator Design C++/CUDA PTX Intermediate NVCC Ocelot Functional Sim Modify Custom Simulator 42 Goals: Fast simulation, Overcome compiler issues for reasonable base case Custom Trace Module Assembly Listing Dynamic Trace Blocks Attachment Points Compressed Trace Data Simulate Different Architecture Than Traced
  • Slide 44
  • Talk Outline Introduction The Problem Throughput Architectures Dissertation Goals The Solution Modeling Throughput Performance Cache Performance The Valley Architectural Enhancements Thread Throttling Cache Policies (Indexing, replacement) Methodology (Applications, Simulation) Proposed Work 43
  • Slide 45
  • Phase 1 HPC Applications Looked at GEMM, FFT & Homme in CMP setting Learned implementation algorithms, alternative algorithms Expertise allows for credible throughput analysis Valuable Lessons in multithreading and caching Dense Matrix Multiply Blocking to maximize arithmetic intensity Enough contexts to cover latency Fast Fourier Transform Pathologically hard on memory system Communication & synchronization HOMME weather modeling Intra-chip scaling incredibly difficult Memory system performance variation Replacing data movement with computation First author publications: PPoPP 2008, ISPASS 2011 (Best Paper) 44
  • Slide 46
  • Phase 2 Benchmark Characterization Memory Access Characteristics of Rodinia and Parboil benchmarks Apply Mathematical Analysis Validate model Find optimum operating points for benchmarks Find optimum TA topology for benchmarks NEARLY COMPLETE 45
  • Slide 47
  • Phase 3 Evaluate Enhancements Automatic Thread Throttling Low latency hierarchical cache Benefits of odd-sets/odd-banking Benefits of explicit placement (Priority/Epoch) NEED FINAL EVALUATION and explicit placement study 46
  • Slide 48
  • Final Phase Extend Domain Study regular HPC applications in throughput setting Add at least two irregular benchmarks Less likely to benefit from caching New opportunities for enhancement Explore impact of future TA topologies Memory Cubes, TSV DRAM, etc. 47
  • Slide 49
  • Proposed Timeline Phase 1 HPC applications completed Phase 2 Mathematical model & Benchmark Characterization MAY-JUNE Phase 3 Architectural Enhancements JULY-AUGUST Phase 4 Domain enhancement / new features September-November 48
  • Slide 50
  • Conclusion Dissertation Goals: Quantify the degree single thread performance affects throughput performance for an important class of applications Improve parallel efficiency through thread scheduling, cache topology, and cache policies Feasibility Regular Benchmarks show promising memory behavior Cycle accurate simulator nearly completed 49
  • Slide 51
  • Related Publications To Date 50
  • Slide 52
  • One Outlier 51
  • Slide 53
  • Priority Scheduling 52
  • Slide 54
  • Talk Outline Introduction Throughput Architectures - The Problem Dissertation Overview Modeling Throughput Performance Throughput Caches The Valley Methodology Architectural Enhancements Thread Scheduling Cache Policies Odd-set/Odd-bank caches Placement Policies Cache Topology Dissertation Timeline 53
  • Slide 55
  • Modeling Throughput Performance 54 N T = Total Active Threads P CHIP = Total Throughput Performance P ST = Single Thread Performance L AVG = Average Latency per instruction Power CHIP = E AVG (Joules)xP CHIP
  • Slide 56
  • Phase 1 HPC Applications Looked at GEMM, FFT & Homme in CMP setting Learned implementation algorithms, alternative algorithms Expertise allows for credible throughput analysis Valuable Lessons in multithreading and caching Dense Matrix Multiply Blocking to maximize arithmetic intensity Need enough contexts to cover latency Fast Fourier Transform Pathologically hard on memory system Communication & synchronization HOMME weather modeling Intra-chip scaling incredibly difficult Memory system performance variation Replacing data movement with computation Most significant publications: 55
  • Slide 57
  • Odd Banking - Scratchpad 56
  • Slide 58
  • Talk Outline Introduction Throughput Architectures - The Problem Dissertation Overview Modeling Throughput Performance Throughput Caches The Valley Methodology Architectural Enhancements Thread Scheduling Cache Policies Odd-set/Odd-bank caches Placement Policies Cache Topology Dissertation Timeline 57
  • Slide 59
  • Problem - Technology Mismatch 58 Computation is cheap, data movement is expensive: Exponential growth in cores saturates off-chip bandwidth - Performance capped Latency to off-chip DRAM now hundreds of cycles - Need hundreds of threads per core to mask * Bill Dally, IPDPS Keynote, 2011
  • Slide 60
  • Talk Outline Introduction Throughput Architectures - The Problem Dissertation Overview Modeling Throughput Performance Throughput Caches The Valley Methodology Architectural Enhancements Thread Scheduling Cache Policies Odd-set/Odd-bank caches Placement Policies Cache Topology Dissertation Timeline 59
  • Slide 61
  • The Power Wall Socket power economically capped DARPAs UHCP Exascale Initiative: Supercomputers now power capped 10-20x power efficiency by 2017 Supercomputing Moores Law: Double power efficiency every year Post-PC client era requires >20x power efficiency of desktop 60 Even Throughput Architectures arent efficient enough!
  • Slide 62
  • Short Latencies Also Matter 61
  • Slide 63
  • Importance of Scatchpad 62
  • Slide 64
  • Talk Outline Introduction Throughput Architectures - The Problem Dissertation Overview Modeling Throughput Performance Throughput Caches The Valley Methodology Architectural Enhancements Thread Scheduling Cache Policies Odd-set/Odd-bank caches Placement Policies Cache Topology Dissertation Timeline 63
  • Slide 65
  • Work Finished To Date Mathematic Analysis Architectural algorithms Benchmark Characterization Nearly finished full chip simulator Currently simulates one core at a time Almost ready to publish 2 papers 64
  • Slide 66
  • Benchmark Characterization (May-June) Latency Sensitivity with cache feedback, multiple blocks per core Global caching, BW across cores Validate mathematical model with benchmarks Compiler Controls 65
  • Slide 67
  • Architectural Evaluation (July-August) Priority Thread Scheduling Automatic Thread Throttling Optimized Cache Topology Low latency / fast path Odd-set banking Explicit Epoch placement 66
  • Slide 68
  • Extending the Domain (Sep-Nov) Extend benchmarks Port HPC applications/kernels to throughput environment Add at least two irregular applications E.g. Sparse MM, Photon Mapping, Adaptive Finite Elements Extend topologies, enhancements Explore design space of emerging architectures Examine optimizations beneficial to irregular applications 67
  • Slide 69
  • Questions? 68
  • Slide 70
  • Contributions Mathematical Analysis of Throughput Performance Caching, saturated bandwidth, sensitivities to application characteristics, latency Quantify Importance of Single Thread Latency Demonstrate novel enhancements Valley based thread throttling Priority Scheduling Subcritical Caching Techniques 69
  • Slide 71
  • HOMME 70
  • Slide 72
  • Dense Matrix Multiply 71
  • Slide 73
  • PARSEC L2 64KB Hit Rates 72
  • Slide 74
  • Odd Banking, L1 Cache Access 73
  • Slide 75
  • Local vs Global Working Sets 74
  • Slide 76
  • Dynamic Working Sets 75
  • Slide 77
  • Fast Fourier Transform (blocked) 76
  • Slide 78
  • Performance From Caching Assume ideal caches Ignore changes to DRAM latency & off- chip BW 77