Designing On-chip Memory Systems for Throughput Architectures

PowerPoint Presentation

Designing On-chip Memory Systems for Throughput ArchitecturesPh.D. ProposalJeff DiamondAdvisor: Stephen KecklerTurning to Heterogeneous Chips1

AMD - TRINITY

Intel Ivy BridgeWell be seeing a lot more than 2-4 cores per chip really quickly Bill Mark, 2005

NVIDIA Tegra 3Throughput more efficient way to efficiently accelerate parallel applications. Leverage on chip L3 cache to deal with reduced bandwidth compared to dedicated GPUs.Can mention the power wall Supercomputers, sockets, mobile, all power capped.

Punchline we need throughput architectures, but even they need to be more power efficient.

1Talk OutlineIntroductionThe ProblemThroughput ArchitecturesDissertation GoalsThe SolutionModeling Throughput PerformanceArchitectural EnhancementsThread SchedulingCache PoliciesMethodologyProposed Work

2Afer go through outline points, reread first slide thruput

2Throughput Architectures (TA)Key Features:Break single application into threadsUse explicit parallelismOptimize hardware for performance densityNot single thread performanceBenefits:Drop voltage, peak frequencyquadratic improvement in power efficiencyCores smaller, more energy efficientFurther amortize through multithreading, SIMDLess need for OO, register renaming, branch prediction, fast synchronization, low latency ALUs3Performance per area, power per area, performance per wattScratchpad like Godson-TSome simplify memory syste,3Scope Highly Threaded TAArchitecture Continuum:MultithreadingLarge number of threads mask long latencySmall amount of cache primarily for bandwidthCachingLarge amounts of cache to reduce latencySmall number of threadsCan we get benefits of both?

4

Power 74 threads/core~1MB/threadSPARC T48 threads/core~80KB/threadGTX 58048 threads/core~2KB/threadCMP end vs GPU end On heterogeneous cores, they tend to pair cores from both ends of the spectrum. The initial part of this study dealt with the left side, but the rest of this study focuses on the right side. Agnostic to SIMD.So why not just add some cache? IN a way, heterogeneous cores do this. Turns out that doesnt work so well. To understand why, need to understand the issues with HTTA.Moves pressure from cores to memory system

4Problem - Technology MismatchComputation is cheap, data movement is expensiveHit in L1 cache, 2.5x power of 64-bit FMADDMove across chip, 50x powerFetch from DRAM, 320x powerLimited off-chip bandwidthExponential growth in cores saturates BWPerformance cappedDRAM latency currently hundreds of cyclesNeed hundreds of threads/core in flight to cover DRAM latency

5The Downward SpiralLittles Law Threads needed is proportional to average latencyOpportunity cost in on-chip resourcesThread contextsIn flight memory accessesToo many threads negative feedbackAdding threads to cover latency increases latencySlower register access, thread schedulingReduced LocalityReduces bandwidth and DRAM efficiencyReduces effectiveness of cachingParallel starvation6When more threads increase latency, need more threads to cover latency

6Talk OutlineIntroductionThe ProblemThroughput ArchitecturesDissertation GoalsThe SolutionModeling Throughput PerformanceArchitectural EnhancementsThread SchedulingCache PoliciesMethodologyProposed Work


7Goal: Increase Parallel EfficiencyProblem: Too Many Threads!Increase Parallel Efficiency, i.e.Number of threads for given level of performanceImproves throughput performanceApply low latency cachesLeverage upwards spiralDifficult to mix multithreading and cachingTypically used just for bandwidth amplificationImportant ancillary factorsThread schedulingInstruction Scheduling (per thread parallelism)8Also increase per thread ILP

8ContributionsQuantifying the impact of single thread performance on throughput performanceDeveloping a mathematical analysis of throughput performanceBuilding a novel hybrid-trace based simulation infrastructureDemonstrating unique architectural enhancements in thread scheduling and cache policies9But before we can do any of this, we have to understand throughput performance.

9Talk OutlineIntroductionThe ProblemThroughput ArchitecturesDissertation GoalsThe SolutionModeling Throughput PerformanceCache PerformanceThe ValleyArchitectural EnhancementsThread ThrottlingCache PoliciesMethodologyProposed Work


10Mathematical AnalysisWhy take a mathematical approach?Be very precise about what we want to optimizeUnderstand the relationships and sensitivities to throughput performanceSingle thread performanceCache improvementsApplication characteristicsRapid evaluation of design spaceSuggest most fruitful architectural improvements

11One contribution of my dissertation

Spare the details just the basics Show model of throughput performance, performance from cachesThis is one of the key contributions of this paper11Modeling Throughput Performance

12

PCHIP = Total throughput performancePST = Single thread performanceNT = Total active threadsLAVG = Average instruction latencyPowerCHIP = EAVG(Joules/Ins)xPCHIPHow can caches help throughput performance?If per cycle, get IPC per thread, Latency in cycles. But then multiply by frequency (cycles/sec) to get true performance

Key insight is that Latency varies by number of threads latency is most important, can use caching to benefit latency

Need to address power, which is Performance x energy12Cache As A Performance Unit13

FMADDSRAMArea comparison: FPU = 2-11KB SRAM, 8-40KB eDRAMActive power: 20pJ / OpLeakage power: 1 watt/mm2Active power: 50pJ/L1 access, 1.1nJ/L2 accessLeakage power: 70 milliwatts/mm2Key: How much does a thread need?Make loads 150x faster, 300x more energy efficientUse10-15x less power/mm^2 than FPUsLeakage power comparison:One FPU = ~64KB SRAM / 256KB eDRAMSTT-MRAM 2010 ISCA paper at 32nm scaled to match 28nm nVidia numbers = 36pJ access2011 28nm Bill Dally keynote at IPDPSHas this potential, but need to model its benefits

13Performance From CachingIgnore changes to DRAM latency & off-chip BW We will simulate theseAssume ideal caches (frequency)What is the maximum performance benefit?14

Memory Intensity, M=1-A

A = Arithmetic intensity of application (fraction of non-memory instructions)NT = Total active threads on chipL = LatencyFor power, replace L with E, the average energy per instructionQualitatively identical, but differences more dramaticMemory IntensityKey insight although qualitatvely identical, only peformance (latency) improvements improve parallel efficiency and avoid downwards spiral.

14Ideal Cache = Frequency CacheHit rate depends on amount of cache, application working setStore items used the most timesThis is the concept of frequencyOnce we know an applications memory access characteristics, we can model throughput performance

15Modeling Cache Performance

16

F(c)H(c)PST(c)How do we describe in terms of threads? Do talk about dividing up.

First thing describe horizontal access as representing fraction of entire application working set.After exposing Hit rate, describe how this is divided up, and why you so quickly find yourself in the leftmost side of the graph!

Integration of frequency yields hit rate of ideal cache. Latency linearly varies with miss rate, which is Hc upside downPerformance is approximately 1 over miss rate16Cache Performance Per Thread

17PS(t) is a steep reciprocalHere we show how fast performance drops as a function of threads.Emphasize just how quickly were at the leftmost edge of hit rate, that this is an ideal cache, that theres so little cache per thread so fast17Talk OutlineIntroductionThe ProblemThroughput ArchitecturesDissertation GoalsThe SolutionModeling Throughput PerformanceCache PerformanceThe ValleyArchitectural EnhancementsThread ThrottlingCache PoliciesMethodologyProposed Work


18

(flat access)The Valley in Cache Space

X19

=High PerformanceRemember to describe graph axises!Mention that the graph is not really symmetricalSo what happens if we graph throughput performance in terms of number of active threads

Lets look at total throughput performance in terms of cache per thredThe reason you have a small amount of cache per thread is because you have a lot of threadsPeaks not always the same. J graphs or L graphs

19The Valley In Thread Space

CacheRegimeMTRegimeValley

Width20No CacheCacheTalk about Thread Throttling in CMPs never with a valley involved, never done with TPA. Already evidence of impact.Threads vs. caches: Modeling the behavior of parallel workloads, by Zvika Guz and Oved Itzhak and Idit Keidar and Avinoam Kolodny and Avi Mendelson and Uri C. Weiser, 2010Many-Core vs. Many-Thread Machines: Stay Away From the Valley, 2008Weve uncovered a lotDont know which point is higher

20Prior WorkHong et al, 2009, 2010Simple, cacheless GPU modelsUsed to predict MT peakGuz et al, 2008, 2010Graphed throughput performance with assumed cache profileIdentified valley structureValidated against PARSEC benchmarksNo mathematical analysisMinimal focus on bandwidth limited regimeCMP benchmarksGalal et al, 2011Excellent mathematical analysisFocused on FPU+Register design21The Valley In Thread Space


Width22No CacheCache

Talk about Thread Throttling in CMPs never with a valley involved, never done with TPA. Already evidence of impact.Threads vs. caches: Modeling the behavior of parallel workloads, by Zvika Guz and Oved Itzhak and Idit Keidar and Avinoam Kolodny and Avi Mendelson and Uri C. Weiser, 2010Many-Core vs. Many-Thread Machines: Stay Away From the Valley, 2008Weve uncovered a lotDont know which point is higher

22Energy vs Latency23

Need to simpliy, compare with latency numbersMisqrote nVirida here

23Valley Energy Efficiency

24

Its not just performance thats falling off a cliff!

Point cache regime is most energy efficient

**** Whats wrong with a high latency cache? You will get a boost in peak performance by virtue of more threads running, but you wont improve parallel efficiency, so you wont get high hit rates

Now we have hit rate and BW information, and we know how to use them! Know arithmetic intensityThis immediately indicated two key areas to solve this problem dynamically finding the optimum operating points, and preserving as much of the cache peak as we can, by making caches act more like ideal LFU caches,

24Talk OutlineIntroductionThe ProblemThroughput ArchitecturesDissertation GoalsThe SolutionModeling Throughput PerformanceCache PerformanceThe ValleyArchitectural EnhancementsThread ThrottlingCache PoliciesMethodologyProposed Work


25Contribution Thread ThrottlingHave real time information:Arithmetic intensityBandwidth utilizationCurrent hit rateConservatively approximate localityApproximate optimum operating pointsShut down / Activate threads to increase performanceConcentrate power and overclockClock off unused cache if no benefit

26Will do two elements thread throttling and thread scheduling

26Prior WorkMany studies in CMP and GPU area scale back threadsCMP miss rates get too highGPU off-chip bandwidth is saturatedSimple to hit, unidirectionalValley is much more complexTwo points to hitThree different operating regimesMathematical analysis lets us approximate both points with as little as two samplesBoth off-chip bandwidth and reciprocal of hit rate are nearly linear for a wide range of applications27Finding Optimal Points


Width28No CacheCacheFirst and second order approximations with a few points. Look at arithmetic intensity. Lower MT bound and upper cache bound. Can see if cache point could even possibly win.Once measure hit rate, get a second order approximation.LOTS OF METHODS AT OUR DISPOSAL, part of research will be to see which methods are more robust. The key insight is that these methods rely on the properties of cache behavior, not application behavior, so its critical the caches are well behaved.

Talk about Thread Throttling in CMPs never with a valley involved, never done with TPA. Already evidence of impact.Threads vs. caches: Modeling the behavior of parallel workloads, by Zvika Guz and Oved Itzhak and Idit Keidar and Avinoam Kolodny and Avi Mendelson and Uri C. Weiser, 2010Many-Core vs. Many-Thread Machines: Stay Away From the Valley, 2008Weve uncovered a lotDont know which point is higher

28Talk OutlineIntroductionThe ProblemThroughput ArchitecturesDissertation GoalsThe SolutionModeling Throughput PerformanceCache PerformanceThe ValleyArchitectural EnhancementsThread ThrottlingCache Policies (Indexing, replacement)MethodologyProposed Work


29From Mathematical Analysis:Need to work like LFU cacheHard to implement in practiceStill very little cache per threadPolicies make big differences for small cachesAssociativity a big issue for small cachesCannot cache every line referencedBeyond dead line predictionStream lines with lower reuse30Contribution Odd Set IndexingConflict misses pathological issueMost often seen with power of 2 stridesIdea: map to 2N-1 sets/banks insteadTrue Silver BulletVirtually eliminates conflict misses in every setting weve triedReduced scratchpad banks from 32 to 7 at same level of bank conflictsFastest, most efficient implementationAdds just a few gate delaysLogic area < 4% 32-bit integer multiplyCan still access last bank

31POINT: Always makes up for losing that last cache line

31More Preliminary ResultsPARSEC L2 with 64 threads32vFully AssociativeOdd-set, 1 bankDirect Mapped, 1 bankMathematical model shows us low hit rates are important

32Prior WorkPrime number of banks/sets thought idealNo efficient implementationMersenne Primes not so convenient:3, 7, 15, 31, 63, 127, 255We demonstrated wasnt an issueYang, ISCA 92 - prime strides for vector computersShowed 3x speedupWe get correct offset for freeKharbutli, HPCA 04 showed prime sets as hash function for caches worked wellOur implementation faster, more featuresCouldnt appreciate benefits for SPEC

33(Re)placement PoliciesNot all data should be cachedRecent papers for LLC cachesHard drive cache algorithmsFrequency over Recency Frequency hard to implementARC good compromiseDirect Mapping Replacement dominatesLook for explicit approachesPriority ClassesEpochs

34We have done very little preliminary analysis., but believe the simple approach can work because of PC based methods aboveREWRITE TO MAKE EXPLICIT34Prior WorkBelady solved it allThree hierarchies of methodsBest utilized information of prior line usageLight on implementation detailsApproximationsHallnor & Reinhardt, ISCA 2000Generational ReplacementMeggido, Usenet 2003, ARC cacheghost entriesrecency and frequency groupsQureshi, 2006, 2007 Adaptive Insertion policiesMultiqueue, LR-K, D-NUCA, etc.35If cant do it all, use 235Talk OutlineIntroductionThe ProblemThroughput ArchitecturesDissertation GoalsThe SolutionModeling Throughput PerformanceCache PerformanceThe ValleyArchitectural EnhancementsThread ThrottlingCache Policies (Indexing, replacement)Methodology (Applications, Simulation)Proposed Work


36BenchmarksInitially studied regular HPC kernels/applications in CMP environmentDense Matrix MultiplyFast Fourier TransformHomme weather simulationAdded CUDA throughput benchmarksParboil old school MPI, coarse grainedRodinia fine grained, variedBenchmarks typical of historical GPGPU applicationsWill add irregular benchmarksSparseMM, Adaptive Finite Elements, Photon mapping

37Used largest input sets37Subset of Benchmarks38For cache analysis, we choose the 6 benchmnarks with the highest memory intensity. We will add in 4 more with medium intensity once we compensate for the compiler

38Preliminary ResultsMost of the benchmarks should benefit:Small working setsConcentrated working setsHit rate curves easy to predict39Typical Concentration of Locality

40People thought thered be no reuse, but they were wrongPoint out that these are actually steps, blocks if you will, not really linear changes

40Scratchpad Task Locality41

Golden Addresses will redraw as a stack graph

41Hybrid Simulator DesignC++/CUDAPTX IntermediateNVCCOcelot Functional SimModifyCustom Simulator42Goals: Fast simulation, Overcome compiler issues for reasonable base caseCustom Trace ModuleAssembly ListingDynamic Trace BlocksAttachment PointsCompressed Trace Data

Simulate Different Architecture Than TracedThis is another contribution We can Simulate a different architecture than we trace42Talk OutlineIntroductionThe ProblemThroughput ArchitecturesDissertation GoalsThe SolutionModeling Throughput PerformanceCache PerformanceThe ValleyArchitectural EnhancementsThread ThrottlingCache Policies (Indexing, replacement)Methodology (Applications, Simulation)Proposed Work


43Phase 1 HPC ApplicationsLooked at GEMM, FFT & Homme in CMP settingLearned implementation algorithms, alternative algorithmsExpertise allows for credible throughput analysisValuable Lessons in multithreading and cachingDense Matrix MultiplyBlocking to maximize arithmetic intensityEnough contexts to cover latencyFast Fourier TransformPathologically hard on memory systemCommunication & synchronizationHOMME weather modelingIntra-chip scaling incredibly difficultMemory system performance variationReplacing data movement with computationFirst author publications: PPoPP 2008, ISPASS 2011 (Best Paper)

44Phase 2 Benchmark CharacterizationMemory Access Characteristics of Rodinia and Parboil benchmarksApply Mathematical AnalysisValidate modelFind optimum operating points for benchmarksFind optimum TA topology for benchmarksNEARLY COMPLETE45Phase 3 Evaluate EnhancementsAutomatic Thread ThrottlingLow latency hierarchical cacheBenefits of odd-sets/odd-bankingBenefits of explicit placement (Priority/Epoch)NEED FINAL EVALUATION and explicit placement study

46Final Phase Extend DomainStudy regular HPC applications in throughput settingAdd at least two irregular benchmarksLess likely to benefit from cachingNew opportunities for enhancementExplore impact of future TA topologiesMemory Cubes, TSV DRAM, etc.

47ConclusionDissertation Goals:Quantify the degree single thread performance effects throughput performance for an important class of applicationsImprove parallel efficiency through thread scheduling, cache topology, and cache policiesFeasibilityRegular Benchmarks show promising memory behaviorCycle accurate simulator nearly completed48Proposed TimelinePhase 1 HPC applications completedPhase 2 Mathematical model & Benchmark CharacterizationMAY-JUNEPhase 3 Architectural EnhancementsJULY-AUGUSTPhase 4 Domain enhancement / new featuresSeptember-November49Related Publications To Date

50Any Questions?51

5253One Outlier

54Priority Scheduling5555Talk OutlineIntroductionThroughput Architectures - The ProblemDissertation OverviewModeling Throughput PerformanceThroughputCachesThe ValleyMethodologyArchitectural EnhancementsThread SchedulingCache PoliciesOdd-set/Odd-bank cachesPlacement PoliciesCache TopologyDissertation Timeline

56Small caches have severe issues

56Modeling Throughput Performance

57

NT = Total Active ThreadsPCHIP = Total Throughput PerformancePST = Single Thread PerformanceLAVG = Average Latency per instructionPowerCHIP = EAVG(Joules)xPCHIPIf per cycle, get IPC per thread, Latency in cycles. But then multiply by frequency (cycles/sec) to get true performance

Key insight is that Latency varies by number of threads latency is most important, can use caching to benefit latency

Need to address power, which is Performance x energy57Phase 1 HPC ApplicationsLooked at GEMM, FFT & Homme in CMP settingLearned implementation algorithms, alternative algorithmsExpertise allows for credible throughput analysisValuable Lessons in multithreading and cachingDense Matrix MultiplyBlocking to maximize arithmetic intensityNeed enough contexts to cover latencyFast Fourier TransformPathologically hard on memory systemCommunication & synchronizationHOMME weather modelingIntra-chip scaling incredibly difficultMemory system performance variationReplacing data movement with computationMost significant publications:58

Odd Banking - Scratchpad59Talk OutlineIntroductionThroughput Architectures - The ProblemDissertation OverviewModeling Throughput PerformanceThroughputCachesThe ValleyMethodologyArchitectural EnhancementsThread SchedulingCache PoliciesOdd-set/Odd-bank cachesPlacement PoliciesCache TopologyDissertation Timeline


60Problem - Technology Mismatch61Computation is cheap, data movement is expensive:Exponential growth in cores saturates off-chip bandwidth- Performance capped Latency to off-chip DRAM now hundreds of cycles- Need hundreds of threads per core to mask

* Bill Dally, IPDPS Keynote, 2011 Still communicating across perimeter, traansfer rates grow slowly, ratio of BW/flop worse in 2017

61Talk OutlineIntroductionThroughput Architectures - The ProblemDissertation OverviewModeling Throughput PerformanceThroughputCachesThe ValleyMethodologyArchitectural EnhancementsThread SchedulingCache PoliciesOdd-set/Odd-bank cachesPlacement PoliciesCache TopologyDissertation Timeline


62The Power WallSocket power economically cappedDARPAs UHCP Exascale Initiative:Supercomputers now power capped10-20x power efficiency by 2017Supercomputing Moores Law:Double power efficiency every yearPost-PC client era requires >20x power efficiency of desktop63Even Throughput Architectures arent efficient enough!Short Latencies Also Matter64Does not include the feedback from caching, which would amplify impact of schedulingNote inversions5-6x throughput performance from latency2x throughput performance from scheduling

64Importance of Scatchpad65Talk OutlineIntroductionThroughput Architectures - The ProblemDissertation OverviewModeling Throughput PerformanceThroughputCachesThe ValleyMethodologyArchitectural EnhancementsThread SchedulingCache PoliciesOdd-set/Odd-bank cachesPlacement PoliciesCache TopologyDissertation Timeline


66Work Finished To DateMathematic AnalysisArchitectural algorithmsBenchmark CharacterizationNearly finished full chip simulatorCurrently simulates one core at a time

Almost ready to publish 2 papers67Benchmark Characterization (May-June)Latency Sensitivity with cache feedback, multiple blocks per coreGlobal caching, BW across coresValidate mathematical model with benchmarksCompiler Controls

68Architectural Evaluation(July-August)Priority Thread SchedulingAutomatic Thread ThrottlingOptimized Cache TopologyLow latency / fast pathOdd-set bankingExplicit Epoch placement

69These 2 papers should not take 4 months to do very conseravative, just gong by paper deadlines

69Extending the Domain (Sep-Nov)Extend benchmarksPort HPC applications/kernels to throughput environmentAdd at least two irregular applicationsE.g. Sparse MM, Photon Mapping, Adaptive Finite ElementsExtend topologies, enhancementsExplore design space of emerging architecturesExamine optimizations beneficial to irregular applications70Will likely start earlier70Questions?71ContributionsMathematical Analysis of Throughput PerformanceCaching, saturated bandwidth, sensitivities to application characteristics, latencyQuantify Importance of Single Thread LatencyDemonstrate novel enhancementsValley based thread throttlingPriority SchedulingSubcritical Caching Techniques72HOMME73Dense Matrix Multiply74PARSEC L2 64KB Hit Rates75Backup Slide75Odd Banking, L1 Cache Access76Local vs Global Working Sets77Dynamic Working Sets

78Fast Fourier Transform (blocked)

79Performance From CachingAssume ideal cachesIgnore changes to DRAM latency & off-chip BW

8080

Designing On-chip Memory Systems for Throughput Architectures

Documents

Transcript of Designing On-chip Memory Systems for Throughput Architectures