Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec...
-
date post
21-Dec-2015 -
Category
Documents
-
view
212 -
download
0
Transcript of Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec...
Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs
Group Talk: Dec 10, 2008
Abhishek Bhattacharjee
Why Thread Criticality Prediction (TCP) ?
Significant load imbalance in typical parallel programs Threads of a parallel program don’t finish at the same time
Worse with heterogeneity (process variation, thermal emergencies etc.)
Relative thread speed or criticality is difficult to predict
If we can deduce that thread 2 is critical and by how much, what can we do with this ?
Performance Improvements from TCP
TCP to improve dynamic parallel management performance (eg. TBB) Steal tasks from
critical threads
Energy Efficiency from TCP
TCP-guided DVFS for barriers Slow down non-critical
threads without affecting runtime
Our Goals
Develop low-overhead hardware TCP schemes Harness counters and metrics available on-chip Aim for high accuracy across architectures
TCP should be general for a variety of applications Improve TBB performance with TCP-guided task stealing Improve barrier energy efficiency with TCP-guided DVFS
Related Work
Thrifty Barrier [Li, Martinez, Huang, HPCA ‘05] Transition fast, non-critical threads into low-power sleep
modes to save energy at barriers Predict barrier stall time based purely on history Still wastes energy in compute phase…
Meeting Points [Cai et al. PACT ’08] DVFS threads to save energy without performance penalty Only applicable to parallel loops Insert meeting points to track loop iterations executed per
core Broadcast iteration counts to all cores Use software to calculate appropriate DVFS settings
1. Unlike thrifty barrier, predict before barrier is reached2. Unlike meeting points, avoid specialized instructions &
broadcasts 3. Unlike meeting points, broader applicability than parallel
loops4. Unlike both approaches, use software-independent
criticality calculation 5. Target versatility: TBB performance, barrier energy waste,
SMT priority schemes, memory priority schemes etc.
Outline
Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers
with TCPs
Methodology: Simulators
TCP accuracy is harder with in-order pipelines Assess energy savings accurately and with OS on emulator
Assumes 50% fixed leakage energy cost of baseline
Methodology: Benchmarks
Use larger, realistic data sets for Splash-2 [Bienia et al. PACT ’08]
Outline
Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers
with TCPs
Per-Core Fully-Local History
If behavior is sufficiently repetitive, can use fully-local history (thrifty barrier)
Difficult to achieve on in-order pipelines
Solution: target thread comparative info. Rationale: criticality of a single thread
determined by others
Comparative Metric: Instruction Count
Normalize compute times and metric against critical thread 6
Comparative Metric: Instruction Count
Avg. error from barrier iterations and 10% execution snapshots of non-barrier apps (Swaptions, Fluidanimate)
Poor accuracy across all tested benchmarks
Comparative Metric: Cache Statistics
But Ocean and LU still suffer from over 25% error
Include L1 I-Cache Misses …
Comparative Metric: Cache Statistics
Instruction counts and control flow affects LU, Water-Nsq, Water-Sp
But Ocean still has over 22% error
Include L2 Cache Misses …
Comparative Metric: Cache Statistics
Memory-intensive Ocean, Volrend, Radix, PARSEC particularly improved
Now check weighted cache misses metric on out-of-order machine
Comparative Metric: Cache Statistics
Memory intensive Ocean, Volrend Radix, PARSEC least impacted
Weighted cache misses is most accurate
Comparative Metric: Control Flow and TLB Misses Control flow (branch mispredictions) penalties and
instruction count effects tracked with L1 I-Cache misses
TLB Misses Little variation among multiple threads (usually access
closely spaced data) Trivial to include weighted TLB component if necessary
Outline
Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers
with TCPs
Basic TCP Hardware Criticality counters
placed with shared, unified L2 cache
Simple, scalable hardware
Eliminate broadcasts as L2 controller sees all cache misses
Can accommodate more cache levels, split L2, or distributed LLCs with trivial HW and messages
Outline
Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers
with TCPs
The TBB Task Scheduler
Concurrency is expression in tasks TBB dynamic scheduler stores and distributes tasks Scheduler has control of worker thread with per-thread
software queue Threads try to extract tasks from local queue If queues are empty, threads steal tasks from remote
queues If steal unsuccessful, back off for pre-determined time before
retrying
Problem: Steal victim is chosen randomly Poor performance at higher core counts and high load imbalance
Our TCP-Guided TBB Stealing Algorithm1. If (Cache miss from Core P) {2. Update criticality counter for Core P based on cache miss type3. }4. If (Steal request from a Core){5. Scan all criticality counters to find the maximum value6. Report core with highest criticality counter value as steal victim7. }8. If (Message indicating steal from victim Core P unsuccessful) {9. Reset criticality counter for Core P10 }11. If ( (Number of Cycles % Interval Bound) = = 0 )12. Reset all criticality counters
Hardware Details
Interval Bound = 100K cycles
Simple and scalable hardware: Even at 64 cores with 14 bits per Criticality Counter, 114
bytes of storage
Minimal message overhead
TCP access takes the same latency as L2 cache miss
Steal Rate Improvements with TCP
• TCP-guided task stealing limits false negatives under 7% • Now, at higher cores, greater success rates
Performance Improvements with TCP
• Up to 32% performance gains against random stealing• Regularly outperforms occupancy-based stealing
• Streamcluster is highly load imbalanced TCP and occupancy benefits are similar
Outline
Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers
with TCPs
TCP-Driven Energy Efficiency in Barrier-Based Parallel Programs Goals recap:
Multiple threads reach barriers at different times Use TCP to predict thread speeds DVFS fast threads to low frequencies
Assume f0 , 0.85f0, 0.70f0, 0.55f0
Aim: energy efficiency with no performance penalty
But: TCP mispredictions due to spurious program behavior DVFS transition overhead impact should be minimized
TCP-Driven DVFS Hardware
Cache miss from core P Update criticality counter P
Is P running at fo and is counter above T?
TCP-Driven DVFS Hardware
For all cores, find closest SST match to Criticality Counter
Is SST suggesting freq. switch?
TCP-Driven DVFS Hardware
If SST suggesting freq. switch, increment suggested SCT counter, decrement others
If max. counter for new DVFS setting, actual DVFS
TCP-Driven DVFS Hardware
If SST not suggesting freq. switch, increment current SCT counter, decrement others
TCP-Driven DVFS Hardware
Reset criticality counters
Criticality Counter Threshold
In-order pipeline results (harder case) – 16 cores Balance between TCP speed and accuracy
Avg. Accuracy @ 1024 = 78.19 %
Integrating the Suggestion Conf. Table
Avg. Accuracy @ 2 bits = 92.68 %
Case Study: Impact of SCT on Streamcluster
Impact of Memory Parallelism
Gradual DVFS
Might not save as much energy as direct DVFS at low MHSRs But at high MHSRs, much higher accuracy than direct DVFS
TCP-Guided DVFS Scheme Performance
Energy Savings
FPGA platform with 4-cores, 50% fixed leakage cost
Even higher savings expected with greater core counts, core complexity, realistic modeling of leakage (temperature impact), on-chip switching regulators
Hardware Characteristics
Based on readily available on-chip cache stats All thread criticality calculation done in hardware
with SST Minimal network messages Low overhead and scalable
16 core CMP – 71 bytes of storage 64 CMP – 215 bytes of storage
Conclusion
Low overhead TCPs help manage parallelism for energy and performance
Accurate TCPs can be based on simple cache statistics
TCP-based TBB task stealer offers 12.9% to 32% performance improvements on 32-core CMP
TCP-based DVFS offers 15% energy savings on 4-core CMP
Future: TLB prefetching, DRAM scheduling