Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec...

Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs

Group Talk: Dec 10, 2008

Abhishek Bhattacharjee

Why Thread Criticality Prediction (TCP) ?

Significant load imbalance in typical parallel programs Threads of a parallel program don’t finish at the same time

Worse with heterogeneity (process variation, thermal emergencies etc.)

Relative thread speed or criticality is difficult to predict

If we can deduce that thread 2 is critical and by how much, what can we do with this ?

Performance Improvements from TCP

TCP to improve dynamic parallel management performance (eg. TBB) Steal tasks from

critical threads

Energy Efficiency from TCP

TCP-guided DVFS for barriers Slow down non-critical

threads without affecting runtime

Our Goals

Develop low-overhead hardware TCP schemes Harness counters and metrics available on-chip Aim for high accuracy across architectures

TCP should be general for a variety of applications Improve TBB performance with TCP-guided task stealing Improve barrier energy efficiency with TCP-guided DVFS

Related Work

Thrifty Barrier [Li, Martinez, Huang, HPCA ‘05] Transition fast, non-critical threads into low-power sleep

modes to save energy at barriers Predict barrier stall time based purely on history Still wastes energy in compute phase…

Meeting Points [Cai et al. PACT ’08] DVFS threads to save energy without performance penalty Only applicable to parallel loops Insert meeting points to track loop iterations executed per

core Broadcast iteration counts to all cores Use software to calculate appropriate DVFS settings

1. Unlike thrifty barrier, predict before barrier is reached2. Unlike meeting points, avoid specialized instructions &

broadcasts 3. Unlike meeting points, broader applicability than parallel

loops4. Unlike both approaches, use software-independent

criticality calculation 5. Target versatility: TBB performance, barrier energy waste,

SMT priority schemes, memory priority schemes etc.

Outline

Methodology Predicting Thread Criticality Basic TCP Hardware Improving TBB Performance with TCPs Minimizing Energy Waste at Barriers

with TCPs

Methodology: Simulators

TCP accuracy is harder with in-order pipelines Assess energy savings accurately and with OS on emulator

Assumes 50% fixed leakage energy cost of baseline

Methodology: Benchmarks

Use larger, realistic data sets for Splash-2 [Bienia et al. PACT ’08]

Outline


with TCPs

Per-Core Fully-Local History

If behavior is sufficiently repetitive, can use fully-local history (thrifty barrier)

Difficult to achieve on in-order pipelines

Solution: target thread comparative info. Rationale: criticality of a single thread

determined by others

Comparative Metric: Instruction Count

Normalize compute times and metric against critical thread 6

Comparative Metric: Instruction Count

Avg. error from barrier iterations and 10% execution snapshots of non-barrier apps (Swaptions, Fluidanimate)

Poor accuracy across all tested benchmarks

Comparative Metric: Cache Statistics

But Ocean and LU still suffer from over 25% error

Include L1 I-Cache Misses …


Instruction counts and control flow affects LU, Water-Nsq, Water-Sp

But Ocean still has over 22% error

Include L2 Cache Misses …


Memory-intensive Ocean, Volrend, Radix, PARSEC particularly improved

Now check weighted cache misses metric on out-of-order machine


Memory intensive Ocean, Volrend Radix, PARSEC least impacted

Weighted cache misses is most accurate

Comparative Metric: Control Flow and TLB Misses Control flow (branch mispredictions) penalties and

instruction count effects tracked with L1 I-Cache misses

TLB Misses Little variation among multiple threads (usually access

closely spaced data) Trivial to include weighted TLB component if necessary

Outline


with TCPs

Basic TCP Hardware Criticality counters

placed with shared, unified L2 cache

Simple, scalable hardware

Eliminate broadcasts as L2 controller sees all cache misses

Can accommodate more cache levels, split L2, or distributed LLCs with trivial HW and messages

Outline


with TCPs

The TBB Task Scheduler

Concurrency is expression in tasks TBB dynamic scheduler stores and distributes tasks Scheduler has control of worker thread with per-thread

software queue Threads try to extract tasks from local queue If queues are empty, threads steal tasks from remote

queues If steal unsuccessful, back off for pre-determined time before

retrying

Problem: Steal victim is chosen randomly Poor performance at higher core counts and high load imbalance

Our TCP-Guided TBB Stealing Algorithm1. If (Cache miss from Core P) {2. Update criticality counter for Core P based on cache miss type3. }4. If (Steal request from a Core){5. Scan all criticality counters to find the maximum value6. Report core with highest criticality counter value as steal victim7. }8. If (Message indicating steal from victim Core P unsuccessful) {9. Reset criticality counter for Core P10 }11. If ( (Number of Cycles % Interval Bound) = = 0 )12. Reset all criticality counters

Hardware Details

Interval Bound = 100K cycles

Simple and scalable hardware: Even at 64 cores with 14 bits per Criticality Counter, 114

bytes of storage

Minimal message overhead

TCP access takes the same latency as L2 cache miss

Steal Rate Improvements with TCP

• TCP-guided task stealing limits false negatives under 7% • Now, at higher cores, greater success rates

Performance Improvements with TCP

• Up to 32% performance gains against random stealing• Regularly outperforms occupancy-based stealing

• Streamcluster is highly load imbalanced TCP and occupancy benefits are similar

Outline


with TCPs

TCP-Driven Energy Efficiency in Barrier-Based Parallel Programs Goals recap:

Multiple threads reach barriers at different times Use TCP to predict thread speeds DVFS fast threads to low frequencies

Assume f0 , 0.85f0, 0.70f0, 0.55f0

Aim: energy efficiency with no performance penalty

But: TCP mispredictions due to spurious program behavior DVFS transition overhead impact should be minimized

TCP-Driven DVFS Hardware

Cache miss from core P Update criticality counter P

Is P running at fo and is counter above T?


For all cores, find closest SST match to Criticality Counter

Is SST suggesting freq. switch?


If SST suggesting freq. switch, increment suggested SCT counter, decrement others

If max. counter for new DVFS setting, actual DVFS


If SST not suggesting freq. switch, increment current SCT counter, decrement others


Reset criticality counters

Criticality Counter Threshold

In-order pipeline results (harder case) – 16 cores Balance between TCP speed and accuracy

Avg. Accuracy @ 1024 = 78.19 %

Integrating the Suggestion Conf. Table

Avg. Accuracy @ 2 bits = 92.68 %

Case Study: Impact of SCT on Streamcluster

Impact of Memory Parallelism

Gradual DVFS

Might not save as much energy as direct DVFS at low MHSRs But at high MHSRs, much higher accuracy than direct DVFS

TCP-Guided DVFS Scheme Performance

Energy Savings

FPGA platform with 4-cores, 50% fixed leakage cost

Even higher savings expected with greater core counts, core complexity, realistic modeling of leakage (temperature impact), on-chip switching regulators

Hardware Characteristics

Based on readily available on-chip cache stats All thread criticality calculation done in hardware

with SST Minimal network messages Low overhead and scalable

16 core CMP – 71 bytes of storage 64 CMP – 215 bytes of storage

Conclusion

Low overhead TCPs help manage parallelism for energy and performance

Accurate TCPs can be based on simple cache statistics

TCP-based TBB task stealer offers 12.9% to 32% performance improvements on 32-core CMP

TCP-based DVFS offers 15% energy savings on 4-core CMP

Future: TLB prefetching, DRAM scheduling

Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec...

Documents

Transcript of Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec...