Designing On-chip Memory Systems for Throughput Architectures Ph.D. Proposal Jeff Diamond Advisor:...

Designing On-chip Memory Systems for Throughput Architectures Ph.D. Proposal Jeff Diamond Advisor: Stephen Keckler

Turning to Heterogeneous Chips 1 AMD - TRINITY Intel Ivy Bridge Well be seeing a lot more than 2-4 cores per chip really quickly Bill Mark, 2005 nVIDIA Tegra 3

Talk Outline Introduction The Problem Throughput Architectures Dissertation Goals The Solution Modeling Throughput Performance Architectural Enhancements Thread Scheduling Cache Policies Methodology Proposed Work 2

Throughput Architectures (TA) Key Features: Use explicit parallelism to break application into threads Optimize hardware for performance density, not single thread performance Benefits: Drop voltage, peak frequency quadratic improvement in power efficiency Cores smaller, more energy efficient Less need for OO, register renaming, branch prediction, fast synchronization, low latency ALUs Further economize through multithreading each core Amortize expense using SIMD 3

Scope Highly Threaded TA Architecture Continuum: Multithreading Large number of threads mask long latency Small amount of cache primarily for bandwidth Caching Large amounts of cache to reduce latency Small number of threads Can we get benefits of both? 4 Power 7 4 threads/core ~1MB/thread SPARC T4 8 threads/core ~80KB/thread GTX 580 48 threads/core ~2KB/thread

Problem - Technology Mismatch Computation is cheap, data movement is expensive Hit in L1 cache, 2.5x power of 64-bit FMADD Move across chip, 50x power Fetch from DRAM, 320x power Exponential growth in cores saturates off- chip bandwidth Performance capped Latency to off-chip DRAM now hundreds of cycles Need hundreds of threads in flight to cover latency 5

The Downward Spiral Littles Law Threads needed is proportional to average latency On-chip resources: opportunity cost Thread contexts In flight memory accesses Too many threads negative feedback Adding threads to cover latency increases latency Slower register access, thread scheduling Reduced Locality Reduces bandwidth and DRAM efficiency Reduces effectiveness of caching Parallel starvation 6

Goal: Increase Parallel Efficiency Problem: Too Many Threads! Increase Parallel Efficiency, i.e. Number of threads needed to achieve a given level of performance Improves throughput performance Apply low latency caches Leverage upwards spiral Difficult to mix multithreading and caching Typically used just for bandwidth amplification Important factors Thread scheduling Instruction Scheduling (per thread parallelism) 8

Contributions Quantifying the impact of single thread performance on throughput performance Developing a mathematical analysis of throughput performance Building a novel hybrid-trace based simulation infrastructure Demonstrating unique architectural enhancements in thread scheduling and cache policies 9

Talk Outline Introduction The Problem Throughput Architectures Dissertation Goals The Solution Modeling Throughput Performance Cache Performance The Valley Architectural Enhancements Thread Throttling Cache Policies Methodology Proposed Work 11

Mathematical Analysis Why take a mathematical approach? Be very precise about what we want to optimize Understand the relationship and sensitivities to throughput performance: Single thread performance Cache improvements Application characteristics Suggest most fruitful architectural improvements 12

Modeling Throughput Performance 13 N T = Total Active Threads P CHIP = Total Throughput Performance P ST = Single Thread Performance L AVG = Average Latency per instruction Power CHIP = E AVG (Joules)xP CHIP

Cache As A Performance Unit 14 Key: How much does a thread need? FMADD SRAM Area: 2-11KB SRAM, 8-40KB eDRAM Shared through pipelining Active Power: 20pJ / Op Leakage Power: 1 watt/mm^2 Active Power: 50pJ/L1 access, 1.1nJ/L2 access Leakage Power: 70 milliwatts/mm^2 1.4 watts/MB eDRAM 350milliwatts/MB Make loads 150x faster, 300x more energy efficient Use10-15x less power/mm^2 than FPUs One FPU = ~64KB SRAM / 256KB eDRAM

Performance From Caching Ignore changes to DRAM latency & off-chip BW We will simulate these Assume ideal caches What is the maximum performance benefit? 15 Memory Intensity, M=1-A N T = Total active threads on chip A = Arithmetic intensity of application (fraction of non-memory instructions) L = Average latency per instruction For power, replace L with E, the average energy per instruction Qualitatively identical, but differences more dramatic

Ideal Cache = Frequency Cache Hit rate depends on amount of cache, application working set Store items used the most times This is the concept of frequency Once we know an applications memory access characteristics, we can model throughput performance 16

Modeling Cache Performance 17

Performance Per Thread P S (t) is a steep reciprocal 18

Valley in Cache Space X = 20

Valley Annotated Cache Regime MT Regime Valley Width 21 No Cache Cache

Prior Work Hong et al, 2009, 2010 Simple, cacheless GPU models Used to predict MT peak Guz et al, 2008, 2010 Graphed throughput performance with assumed cache profile Identified valley structure Validated against PARSEC benchmarks No mathematical analysis Didnt analysis bandwidth limited regime Focus on CMP benchmarks Galal et al, 2011 Excellent mathematical analysis Focused on FPU+Register design 22

Valley Annotated Cache Regime MT Regime Valley Width 23 No Cache Cache

Energy vs Latency 24 * Bill Dally, IPDPS Keynote, 2011

Valley Energy Efficiency 25

Thread Throttling Have real time information Arithmetic Intensity Bandwidth Utilization Current Hit Rate Can match to approximate/conservative locality Approximate optimum operating points Shut down / Activate threads to increase performance Concentrate power and overclock 28

Prior Work Several studies in CMP and GPU area scale back threads CMP When miss rates get too high GPU When off-chip bandwidth is saturated Prior attempts simple, unidirectional We have two complex points to hit, three different operating regimes Mathematical analysis lets us approximate both points with as little as two samples Both off-chip bandwidth and 1/Hitrate are nearly linear for a wide range of applications 29

Talk Outline Introduction The Problem Throughput Architectures Dissertation Goals The Solution Modeling Throughput Performance Cache Performance The Valley Architectural Enhancements Thread Throttling Cache Policies (Indexing, replacement) Methodology Proposed Work 30

Mathematical Analysis Need to work like LFU cache Hard to implement in practice Still very little cache per thread Policies make big differences for small caches Associativity a big issue Cannot cache every line referenced Beyond dead line prediction Stream lines with lower reuse 31

Cache Conflict Misses Different addresses map to same way Programmers prefer power of 2 array sizes Power of 2 strides pathological Prime number of banks/sets thought ideal No efficient implementation Mersenne Primes not so convenient: 3, 7, 15, 31, 63, 127, 255 Early paper on prime strides for vector computers showed 3x speedup Kharbutli, HPCA 04 showed prime sets as hash function for caches worked well Odd-sets work as well Fastest implementation of DIV-MOD Silver Bullet, e.g., allowed banks with same conflict rate 32

Early Study using PARSEC PARSEC L2 with 64 threads 33

(Re)placement Policies Not all data should be cached Recent papers for LLC caches Hard drive cache algorithms Frequency over Recency Frequency hard to implement ARC good compromise Direct Mapping Replacement dominates Look for explicit approaches Priority Classes Epochs 34

Prior Work Belady solved it all, light on implementation details Three hierarchies of methods Best one utilized information of prior line usage Approximations ARC cache ghost entries, recency and frequency groups Generational Caches, multiqueue Qureshi, 2006, 2007 Adaptive Insertion policies 35

Talk Outline Introduction The Problem Throughput Architectures Dissertation Goals The Solution Modeling Throughput Performance Cache Performance The Valley Architectural Enhancements Thread Throttling Cache Policies (Indexing, replacement) Methodology (Applications, Simulation) Proposed Work 36

Benchmarks Initially studied regular HPC kernels/applications in CMP environment Dense Matrix Multiply Fast Fourier Transform Homme weather simulation Added CUDA throughput benchmarks Parboil old school MPI, coarse grained Rodinia fine grained, varied Benchmarks typical of historical GPGPU applications Will add irregular benchmarks SparseMM, Adaptive Finite Elements, Photon mapping 37

Subset of Benchmarks 38

Preliminary Results Most of the benchmarks should benefit: Small working sets Concentrated working sets Hit rate curves easy to predict 39

Typical Concentration of Locality 40

Scratchpad Locality 41

Hybrid Simulator Design C++/CUDA PTX Intermediate NVCC Ocelot Functional Sim Modify Custom Simulator 42 Goals: Fast simulation, Overcome compiler issues for reasonable base case Custom Trace Module Assembly Listing Dynamic Trace Blocks Attachment Points Compressed Trace Data Simulate Different Architecture Than Traced

Talk Outline Introduction The Problem Throughput Architectures Dissertation Goals The Solution Modeling Throughput Performance Cache Performance The Valley Architectural Enhancements Thread Throttling Cache Policies (Indexing, replacement) Methodology (Applications, Simulation) Proposed Work 43

Phase 1 HPC Applications Looked at GEMM, FFT & Homme in CMP setting Learned implementation algorithms, alternative algorithms Expertise allows for credible throughput analysis Valuable Lessons in multithreading and caching Dense Matrix Multiply Blocking to maximize arithmetic intensity Enough contexts to cover latency Fast Fourier Transform Pathologically hard on memory system Communication & synchronization HOMME weather modeling Intra-chip scaling incredibly difficult Memory system performance variation Replacing data movement with computation First author publications: PPoPP 2008, ISPASS 2011 (Best Paper) 44

Phase 2 Benchmark Characterization Memory Access Characteristics of Rodinia and Parboil benchmarks Apply Mathematical Analysis Validate model Find optimum operating points for benchmarks Find optimum TA topology for benchmarks NEARLY COMPLETE 45

Phase 3 Evaluate Enhancements Automatic Thread Throttling Low latency hierarchical cache Benefits of odd-sets/odd-banking Benefits of explicit placement (Priority/Epoch) NEED FINAL EVALUATION and explicit placement study 46

Final Phase Extend Domain Study regular HPC applications in throughput setting Add at least two irregular benchmarks Less likely to benefit from caching New opportunities for enhancement Explore impact of future TA topologies Memory Cubes, TSV DRAM, etc. 47

Proposed Timeline Phase 1 HPC applications completed Phase 2 Mathematical model & Benchmark Characterization MAY-JUNE Phase 3 Architectural Enhancements JULY-AUGUST Phase 4 Domain enhancement / new features September-November 48

Conclusion Dissertation Goals: Quantify the degree single thread performance affects throughput performance for an important class of applications Improve parallel efficiency through thread scheduling, cache topology, and cache policies Feasibility Regular Benchmarks show promising memory behavior Cycle accurate simulator nearly completed 49

Related Publications To Date 50

One Outlier 51

Priority Scheduling 52

Talk Outline Introduction Throughput Architectures - The Problem Dissertation Overview Modeling Throughput Performance Throughput Caches The Valley Methodology Architectural Enhancements Thread Scheduling Cache Policies Odd-set/Odd-bank caches Placement Policies Cache Topology Dissertation Timeline 53

Modeling Throughput Performance 54 N T = Total Active Threads P CHIP = Total Throughput Performance P ST = Single Thread Performance L AVG = Average Latency per instruction Power CHIP = E AVG (Joules)xP CHIP

Phase 1 HPC Applications Looked at GEMM, FFT & Homme in CMP setting Learned implementation algorithms, alternative algorithms Expertise allows for credible throughput analysis Valuable Lessons in multithreading and caching Dense Matrix Multiply Blocking to maximize arithmetic intensity Need enough contexts to cover latency Fast Fourier Transform Pathologically hard on memory system Communication & synchronization HOMME weather modeling Intra-chip scaling incredibly difficult Memory system performance variation Replacing data movement with computation Most significant publications: 55

Odd Banking - Scratchpad 56

Problem - Technology Mismatch 58 Computation is cheap, data movement is expensive: Exponential growth in cores saturates off-chip bandwidth - Performance capped Latency to off-chip DRAM now hundreds of cycles - Need hundreds of threads per core to mask * Bill Dally, IPDPS Keynote, 2011

The Power Wall Socket power economically capped DARPAs UHCP Exascale Initiative: Supercomputers now power capped 10-20x power efficiency by 2017 Supercomputing Moores Law: Double power efficiency every year Post-PC client era requires >20x power efficiency of desktop 60 Even Throughput Architectures arent efficient enough!

Short Latencies Also Matter 61

Importance of Scatchpad 62

Work Finished To Date Mathematic Analysis Architectural algorithms Benchmark Characterization Nearly finished full chip simulator Currently simulates one core at a time Almost ready to publish 2 papers 64

Benchmark Characterization (May-June) Latency Sensitivity with cache feedback, multiple blocks per core Global caching, BW across cores Validate mathematical model with benchmarks Compiler Controls 65

Architectural Evaluation (July-August) Priority Thread Scheduling Automatic Thread Throttling Optimized Cache Topology Low latency / fast path Odd-set banking Explicit Epoch placement 66

Extending the Domain (Sep-Nov) Extend benchmarks Port HPC applications/kernels to throughput environment Add at least two irregular applications E.g. Sparse MM, Photon Mapping, Adaptive Finite Elements Extend topologies, enhancements Explore design space of emerging architectures Examine optimizations beneficial to irregular applications 67

Questions? 68

Contributions Mathematical Analysis of Throughput Performance Caching, saturated bandwidth, sensitivities to application characteristics, latency Quantify Importance of Single Thread Latency Demonstrate novel enhancements Valley based thread throttling Priority Scheduling Subcritical Caching Techniques 69

HOMME 70

Dense Matrix Multiply 71

PARSEC L2 64KB Hit Rates 72

Odd Banking, L1 Cache Access 73

Local vs Global Working Sets 74

Dynamic Working Sets 75

Fast Fourier Transform (blocked) 76

Performance From Caching Assume ideal caches Ignore changes to DRAM latency & off- chip BW 77

Designing On-chip Memory Systems for Throughput Architectures Ph.D. Proposal Jeff Diamond Advisor:...

Documents

Transcript of Designing On-chip Memory Systems for Throughput Architectures Ph.D. Proposal Jeff Diamond Advisor:...