Power Management for Memory Systems Ming Chen Nov. 10 th, 2009 ECE 692 Topic Presentation 1.

download Power Management for Memory Systems Ming Chen Nov. 10 th, 2009 ECE 692 Topic Presentation 1.

If you can't read please download the document

Transcript of Power Management for Memory Systems Ming Chen Nov. 10 th, 2009 ECE 692 Topic Presentation 1.

  • Slide 1
  • Power Management for Memory Systems Ming Chen Nov. 10 th, 2009 ECE 692 Topic Presentation 1
  • Slide 2
  • Why Power Control for Main Memory? Memory capacities have been increasing significantly to accommodate CMPs. CPU is no longer the only major power consumer. Memory is highly under-utilized. Requirement on amount Requirement on bandwidth 85.4% 92.2% CPU & memoryCPU : memory 1.21 0.69 2
  • Slide 3
  • Limiting the Power Consumption of Main Memory Acknowledgments: The organization order and contents of some slides are based on Ricardo Bianchinis slides. Bruno Diniz, Dorgival Guedes, Wagner Meira Jr. Federa University of Minas Gerais, Brazil Ricardo Bianchini Rutgers University, USA 3
  • Slide 4
  • Power Saving Vs. Power Control A problem with two sides Trade-off between power and performance Power saving: Guarantee performance first, then minimize power. Performance is primary. Save energy bill. Power control: Power capping: cooling, thermal, packaging, etc Guarantee power budget first, then maximize performance. Power budget is primary. Avoid system failure and thermal violations. 4
  • Slide 5
  • What is This Paper About? Bunches of work done to save power. The first paper I have read on power control for memory. Propose 4 policies for Power Limiting (PL) in memory. Knapsack, LRU-Greedy, LRU-smooth, LRU-Ordered. Combine Power Limiting with Energy Conserving (PL- EC). Also provide performance guarantee. (PL-EC-Perf) An interesting paper that combines the two sides of the power problem together. 5
  • Slide 6
  • Power Actuators RDRAM systems. It is DRAM but not DDR SDRAM Each chip can be transitioned independently. Different power states 6
  • Slide 7
  • Power Limiting Memory Controller chip1 chip3 chip4chip2 A SN S 7
  • Slide 8
  • Power Limiting Memory Controller chip1 chip3 chip4chip2 access A SN S 8
  • Slide 9
  • Power Limiting Memory Controller chip1 chip3 chip4chip2 A SN S 9
  • Slide 10
  • Adjusting Power States Memory Controller chip1 chip3 chip4chip2 SN S N 10
  • Slide 11
  • Power Limiting Memory Controller chip1 chip3 chip4chip2 access S SA N Different approaches to adjust states. 11
  • Slide 12
  • Knapsack: Key Idea Multi-Choice Knapsack Problem (MCKP) Object: memory device. Choices: multiple (power) states. weight : the power consumption. cost : the transition overhead to the active state. Goal: Minimize the cost with the constraint of weight by putting each object in a state. MCKP is NP-hard, which is solved off-line. 12
  • Slide 13
  • chip # power state Knapsack: An example LRU queue is maintained for active devices. The LRU device is the victim. Switch the power states of the two chips. 13
  • Slide 14
  • LRU-Greedy An LRU queue for all devices If a device is to be accessed, move it to the tail and: Active? Go on Not active? Put the LRU to the shallowest state. Still not? adjust the state for the next device. 14
  • Slide 15
  • LRU-Smooth A LRU queue for all devices If a device is to be accessed, move it to the tail and: Active? Go on Not active? Put the LRU to the next lower-power state. Still not? adjust the state for the next device. 15
  • Slide 16
  • LRU-Ordered: Key Idea An LRU queue for active devices An ordered queue for devices in low-power mode (shallowest first) If a device is to be accessed: Active? Move it to the tail of LRU and Go on Not active? Move it from ordered queue to the tail of LRU queue. Put the LRU to the top of the ordered queue. Still not? adjust the state of the next device in the ordered queue to the next lower-power state. 16
  • Slide 17
  • LRU-Ordered: An example 17
  • Slide 18
  • Energy Conservation (PL-EC) If idle time in the current state > break-even time, then lower power state. Minimize delay*power 2 by Knapsack for different # of devices in the active state. Whenever the # of active states is going to change: An active device is transitioned to the next low-power state when threshold expires. A low-power device is transitioned to the active state and it does not violate the budget The memory controller looks up the table and adjusts the states. If the activating device violates the budget, the basic scheme (PL) is used. 18
  • Slide 19
  • Performance Guarantee (PL-EC-Perf) To what extent the energy to be saved? Basic strategy is from Xiaodong Lis ASPLOS04 paper. 5M-cycle epoch User-defined slowdown (3%) compared with PL Compute slack at runtime. If slack < 0, disable EC until the end of epoch Disabling EC means reverting back to the corresponding PL policy. 19
  • Slide 20
  • Evaluation Methodology Single-core in-order CPU with integrated memory controller Simics + memory subsystems OS and physical mapping of virtual pages are both simulated. Memory system is driven by traces generated by Simics. Workloads: MediaBench, SPEC 2000, and client-server applications. Memory size: 512 MB Performance is measured by the execution time of a trace file. 20
  • Slide 21
  • Performance Vs. PL Polices Knapsack and LRU-Ordered are best. 8 chips, 50% power budget. 21
  • Slide 22
  • Energy Vs. Policies Compared with unrestricted execution. Unrestricted < budget bzip? (Critique) 8 chips, 50% power budget. 22
  • Slide 23
  • Performance Vs. Budget 8 chips under LRU-Ordered. Performance degradation is very small. 23
  • Slide 24
  • Energy Vs. Power Budget Saving decreases as budget decreases. Uniform for all workload. 24 8 chips, 50% power budget.
  • Slide 25
  • Performance for PL-EC-Perf 8 chips, 25% power budget LRU-Ordered 3% slowdown PD: an explicit energy saving algorithm PL/PL-EC-Perf almost works no worse than PD Exception is bzip2 ? 25
  • Slide 26
  • Energy Saving for PL/PL-EC-Perf 8 chips, 25% power budget LRU-Ordered 3% slowdown PD: an explicit energy saving algorithm PL/PL-EC-Perf has more energy saving than PD. PD tends to send some chips to very deep states. 26
  • Slide 27
  • Conclusions Four power limiting policies are proposed.(PL) Performance degradation is surprisingly low. Limiting power + energy conserving (PL-EC) Limiting power + energy conserving + performance guarantee (PL-EC-Perf) Limiting power consumption is as effective as doing energy conservation explicitly. 27
  • Slide 28
  • A Performance-Conserving Approach for Reducing Peak Power Consumption in Server Systems Acknowledgments: The organization order and contents of some slides are based on Wes Felters slides. Wes Felter, Karthick Rajamani, Tom Keller IBM Austin Research Lab Cosmin Rusu University of Pittsburgh 28
  • Slide 29
  • Motivations System designers can no longer afford to accommodate peak power of all components (over- provisioning) System failures due to power overload and thermal violation. CPU is no long the only major power consumer. CPU and the main memory share the same power/cooling facility. 29
  • Slide 30
  • Anti-Correlation of Processor and Memory Power Processor and memory are not simultaneously highly utilized in workloads. Intuitively, the processor cant keep itself and the memory busy 30
  • Slide 31
  • Unconstrained System Power In theory, the system can use 83W. 31
  • Slide 32
  • What is This Paper About? Power shifting between processor and memory. The first paper that proposes the concept of power shifting. Power estimation model based on the # of activities. Propose 3 policies for power control in the server level. PLI, sliding window, and on-demand 32
  • Slide 33
  • Processor Power Model Power Vs. Dispatched Instr./cycle 100K-cycle interval 28 applications Linear regression 33
  • Slide 34
  • Memory Power Model Power Vs. Bandwidth 100K-cycle interval 28 applications Linear regression P mem =#ranks*#devices*V DRAM *((I active - I idle )*BW/ Peak BW + I idle )+ P others 34
  • Slide 35
  • Power Actuators Power consumption correlates strongly with activity. Activity regulation techniques: Instruction decoding throttling Clock throttling: effective duty cycles DVFS For processor: Throttle at the instruction dispatch unit of the pipeline. For memory: Limit the total # of memory requests 35
  • Slide 36
  • Processor Core System Power Controller Dispatched Insns. Counter Memory Fetch Throttling Memory Controller Request Counter Request Throttling Goes into powerdown mode when idle Extensive clock gating System Architecture 36
  • Slide 37
  • The # of activities is the same in the next interval. Estimate the power based on history. Allocate power based on estimates. Power allocation is enforced by thresholds of the # of activities Activity-dependent power and standby power Key Ideas 37
  • Slide 38
  • Power estimation CPU power = C1*DPC0+C2 Memory power = M1*BW0+M2 Power to be allocated: P dynamic = P budget C2 M2 DPC1 and BW1 for the next interval Estimated active power: P est = DPC0*C1+BW0*M1 Power allocation DPC1= DPC0*P dynamic /P est BW1=BW0*P dynamic /P est Threshold D th = DPC1*Period M th =BW1*Period Proportional-Last-Interval Policy 38
  • Slide 39
  • Sliding window Shorter interval is better for estimation accuracy. Larger interval is better for reducing noise. A larger window includes 20 intervals. On demand No violations, no throttling Interval should be small enough. Run-To-Exhaustion (RTE) Power is monitored cycle-by-cycle. Throttle when power is violated. Impractical but provides a comparison. Static: proportional to the peak power Other Related Policies 39
  • Slide 40
  • Simulation Environment Traces from hardware (SPEC) and Mambo (e.g. JBB) Integrated simulation environment Turandot+PowerTimer (Zhigang Hu et al., IBM TJ Watson) Timing and power core model 2GHz 970-like core w/ aggressive clock gating 512KB L2 cache (power not simulated) + MEMSIM Timing and power DRAM model 4GB, 4 ranks of 128-bit 400MHz DDR (PC3200) Both simulators synchronized every cycle 40
  • Slide 41
  • Budget: 40 W Much better than static budgeting PLI Vs. Static (1) 41
  • Slide 42
  • Budget: 50 W Average unconstrained power consumption < budget PLI Vs. Static (2) 42
  • Slide 43
  • 100K-cycle interval, 40W budget On-demand is generally the best. At the cost of at least one-interval budget violation Policy Comparison 43
  • Slide 44
  • On-demand and the sliding window are generally the best. On-demand is even better than RTE for art. Not proactively throttle activities and at cost of short violations. Normalized to RTE 44
  • Slide 45
  • Interval Size (PLI) ammp: highly variable even at small interval and steady power. Small interval has better fit for variations. Generally, highly application-dependent. 45
  • Slide 46
  • Critiques Paper 1 Lack explanations for different workload (e.g. bzip). Examples used for policies are not typical. Do not explain why the performance is surprisingly slightly degraded. Paper 2 Open-loop estimation based scheme Model verification Power is not shifting but throttled Very large performance degradation even budget is larger than the average. 46
  • Slide 47
  • Comparison of the Two Papers Limiting PowerPower Shifting TargetPeak power of DRAM systemPeak power in the server level GoalPeak power capping with energy conserving Peak power capping MethodologyKnapsack optimization + heuristic Open-loop estimation SolutionsA bunch of policies and comparison ExperimentsSimulation Power budgetLarger than the average 47
  • Slide 48
  • Thank you ! 48