CSE Dept., (XHU) 1 The Salishan conference on High-Speed Computing No Free Lunch, No Hidden Cost X....

CSE Dept., (XHU) 1The Salishan conference on High-Speed Computing

No Free Lunch, No Hidden Cost

X. Sharon Hu

Dept. Computer Science and Engineering

University of Notre Dame

11Department of Computer Science and Engineering

How Can Co-Design Help?

The Salishan Conference on High-Speed Computing


Theme: Exposing Hidden Execution Costs

Cost of execution: performance and power Computation Communication Data motion Synchronization …

How can we strike a balance between the extremes? Hide as much as possible? Explicitly manage “all” costs?

My “position”: Expose widely and choose wisely Focus on power


Why Taking the Position?

Expose widely Better understanding the contribution by each

component Allowing application-specific tradeoffs Providing opportunities for powerful co-design tools

Choose wisely Requiring sophisticated co-design tools Exploring more algorithm/software options


But Easier Said Than Done! Heterogeneity

Compute nodes: (multi-core) CPU, GP-GPU, FPGA, … Memory components: on-chip, on-board, disks, … Communication infrastructure: bus, NoC, networks, …

Parallelism (”non-determinism”) Data access: movement, coherence, … Resource contention synchronization


Outline

Why expose widely?

How to benefit from exposing widely?

How to choose wisely?

Going forward


Why Expose Widely? (1)

Different programs has different power distribution

MemoryConstSM

ConstCache

TextCache

GPU Cores

}

Hong and Kim, ISCA 2010

GPU Power Distribution (NVidia GTX 280)



Energy consumptions of three sorting algorithms (Pentium 4 + GeForce 570)

Data movement impacts different algorithms differently



Application dependent

Massaki Kondo, et. al., SigARCH 2007

Performance degradation due to memory bus contention


Outline

Why expose widely?



Going forward


How to Benefit from “Exposing Widely”?

Co-design is the key Expose all factors impacting the “execution model”

Computation: processing resource Data motion: memory components and hierarchy Communication: bus and network Resource contention, synchronization… Some examples

Software macromodelingHardware module-based modeling

Optimize through power management Keep in mind Amdahl’s law


Macromodeling: Algorithm Complexity Based

Relate power/energy of a program with its complexity

Example: E = C1S + C2S2 + C3S3 (Tan, et. al. DAC’01) where S is the size of the array for a sorting algorithm

Example: Ecomm = C0 + C1S (Loghi, et. al. ACMTECS’07) where S is the size of exchanged messages

More sophisticated models to account for both computing and communication

How to handle resource contention?


Power Modeling of Bus Contension

Penolazzi, Sander and Ahmed Hemani: DATE’11 Characterization step

C%N,1 : percentage of cycle difference between the N-

processor case and 1-processor case Can be one by IP providers on chosen benchmarks

Prediction step

)1(,)(

)( %1, CTCt

TNt

cycleE

NE Nstall

idleaa EnEEnE )()1()(


Hierarchical Module-Based Power Modeling Accumulate energy/power of modules

CPU+GPU example

Access rate: software dependent

Data movement contributes to memory power

Resource contention modifies access rate

)()()( iotheri

iitotal MPMPMUtilP

idlei

imemGPUCPUtotal PMPPPPP )(

)()(

)()()(

ii

iii

MNonGatedPMMaxP

MgArchScalinMAccessRateMP

Adapted from Isci and Martonosi, Micro’03


Outline

Why expose widely?



Going forward


Managing Bus Contention to Reduce Energy

M. Kondo, H. Sasaki and H. Nakamura, 2006

Counter for mem request

Register for PU identification

Thresholds for selecting which PU uses what Vdd value


Application Mapping to Reduce Energy (1)

Application mapping for heterogeneous systems

J1 J2

J3 J4

([minR1,maxR1], D1) ([minR2,maxR2], D2)

PE 1 PE 2

PE 3 PE 4

Memory

([minR4,maxR4], D4)([minR3,maxR3], D3)

R. Racu, R. Ernst, A. Hamann, B. Mochocki and X. Hu, “Methods for power optimization in distributed embedded systems with real-time requirements,”, CASES’06.


Application Mapping to Reduce Energy (2)

Optimization: Minimize power/energy dissipation Satisfying timing properties (e.g. average path latency,

average lateness, etc.) …

Search Space: Scheduling parameter, traffic shaping, … Task level DVFS, i.e. task speed assignment Resource level DVFS, i.e., resource speed assignment …


Application Mapping (3): Sensitivity Analysis

R. Racu, R. Ernst, A. Hamann, B. Mochocki and X. Hu, “Methods for power optimization in distributed embedded systems with real-time requirements,”, CASES’06.


Application Mapping (4): GA-Based Approach

PowerAnalyzer

2’. Scheduling Trace

3’. Power Dissipation

Power model needed


A Sample Result


Outline

Why expose widely?



Going forward


Going Forward: Systematic Co-design Effort

Expose more More hardware counters / registers More efficient/accurate high-level power models Better models for resource contention and

synchronization

Choose better Handling parallelism

Algorithm, OS, hardwareResource contentionsynchronization

Handling non-determinismWorst case boundsStatistical analysisInterval-based techniques


ES Design v.s. HPCS Design Differences (maybe)

Application specific workloads v.s. domain specific workloads

Constraints, objectives, desirables? latency, throughput, energy, cost, reliability, fault

tolerance, IP protection/privacy, ToM, … Other issues: homogeneous v.s. heterogeneous, levels

of complexity, user expertise,…

Similarities Ever increasing hardware capability: multi-core, multi-

thread, complex communication fabrics, memory hierarchy, …

Productivity gap Common concerns: latency, throughput, energy, cost,

reliability, fault tolerance, …


Leverage Co-Design for HPC

Systematic performance estimation Formal methods: scenario-based, statistical analysis Hybrid approaches: analytical+simulation Seamless migration from one abstraction level to the

next

Efficient design space exploration Efficient search techniques Multiple-level abstraction models Multiple-attribute optimization Others: memory and communication analysis and

design


Thank you!

CSE Dept., (XHU) 1 The Salishan conference on High-Speed Computing No Free Lunch, No Hidden Cost X....

Documents

Transcript of CSE Dept., (XHU) 1 The Salishan conference on High-Speed Computing No Free Lunch, No Hidden Cost X....